[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-EleutherAI--lm-evaluation-harness":3,"tool-EleutherAI--lm-evaluation-harness":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":84,"stars":97,"forks":98,"last_commit_at":99,"license":100,"difficulty_score":10,"env_os":101,"env_gpu":102,"env_ram":103,"env_deps":104,"category_tags":114,"github_topics":115,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":119,"updated_at":120,"faqs":121,"releases":150},2058,"EleutherAI\u002Flm-evaluation-harness","lm-evaluation-harness","A framework for few-shot evaluation of language models.","lm-evaluation-harness 是一个专为大语言模型设计的开源评估框架，旨在通过统一的标准对模型进行少样本（few-shot）能力测试。它主要解决了当前大模型评估中基准分散、环境配置复杂以及结果难以复现的痛点，让研究人员能够在一个平台上轻松对比不同模型在数十种学术任务上的表现。\n\n这款工具非常适合 AI 研究人员、算法工程师以及希望严谨测试模型性能的开发团队使用。无论是验证新训练的模型，还是复现论文中的实验数据，lm-evaluation-harness 都能提供可靠的支持。\n\n其技术亮点十分突出：内置了超过 60 种标准学术基准及数百个子任务；支持多种主流推理后端，包括 Hugging Face Transformers（含量化）、vLLM、SGLang 等，兼顾了灵活性与推理速度；近期更新还引入了基于 YAML 的配置化管理、思维链（CoT）痕迹剥离功能，并初步支持多模态任务评估。此外，它还集成了最新的 Open LLM Leaderboard 任务集，帮助用户紧跟行业评测标准。作为一个社区驱动的项目，lm-evaluation-harness 以模块化设计和丰富的文档","lm-evaluation-harness 是一个专为大语言模型设计的开源评估框架，旨在通过统一的标准对模型进行少样本（few-shot）能力测试。它主要解决了当前大模型评估中基准分散、环境配置复杂以及结果难以复现的痛点，让研究人员能够在一个平台上轻松对比不同模型在数十种学术任务上的表现。\n\n这款工具非常适合 AI 研究人员、算法工程师以及希望严谨测试模型性能的开发团队使用。无论是验证新训练的模型，还是复现论文中的实验数据，lm-evaluation-harness 都能提供可靠的支持。\n\n其技术亮点十分突出：内置了超过 60 种标准学术基准及数百个子任务；支持多种主流推理后端，包括 Hugging Face Transformers（含量化）、vLLM、SGLang 等，兼顾了灵活性与推理速度；近期更新还引入了基于 YAML 的配置化管理、思维链（CoT）痕迹剥离功能，并初步支持多模态任务评估。此外，它还集成了最新的 Open LLM Leaderboard 任务集，帮助用户紧跟行业评测标准。作为一个社区驱动的项目，lm-evaluation-harness 以模块化设计和丰富的文档，成为了大模型领域不可或缺的“标尺”。","# Language Model Evaluation Harness\n\n[![DOI](https:\u002F\u002Fzenodo.org\u002Fbadge\u002FDOI\u002F10.5281\u002Fzenodo.10256836.svg)](https:\u002F\u002Fdoi.org\u002F10.5281\u002Fzenodo.10256836)\n\n---\n\n## Latest News 📣\n- [2025\u002F12] **CLI refactored** with subcommands (`run`, `ls`, `validate`) and YAML config file support via `--config`. See the [CLI Reference](.\u002Fdocs\u002Finterface.md) and [Configuration Guide](.\u002Fdocs\u002Fconfig_files.md).\n- [2025\u002F12] **Lighter install**: Base package no longer includes `transformers`\u002F`torch`. Install model backends separately: `pip install lm_eval[hf]`, `lm_eval[vllm]`, etc.\n- [2025\u002F07] Added `think_end_token` arg to `hf` (token\u002Fstr), `vllm` and `sglang` (str) for stripping CoT reasoning traces from models that support it.\n- [2025\u002F03] Added support for steering HF models!\n- [2025\u002F02] Added [SGLang](https:\u002F\u002Fdocs.sglang.ai\u002F) support!\n- [2024\u002F09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the `hf-multimodal` and `vllm-vlm` model types and `mmmu` task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out [`lmms-eval`](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval), a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features.\n- [2024\u002F07] [API model](docs\u002FAPI_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. **To run Llama 405B, we recommend using VLLM's OpenAI-compliant API to host the model, and use the `local-completions` model type to evaluate the model.**\n- [2024\u002F07] New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval\u002Ftasks\u002Fleaderboard\u002FREADME.md) task group.\n\n---\n\n## Announcement\n\n**A new v0.4.0 release of lm-evaluation-harness is available** !\n\nNew updates and features include:\n\n- **New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval\u002Ftasks\u002Fleaderboard\u002FREADME.md) task group.**\n- Internal refactoring\n- Config-based task creation and configuration\n- Easier import and sharing of externally-defined task config YAMLs\n- Support for Jinja2 prompt design, easy modification of prompts + prompt imports from Promptsource\n- More advanced configuration options, including output post-processing, answer extraction, and multiple LM generations per document, configurable fewshot settings, and more\n- Speedups and new modeling libraries supported, including: faster data-parallel HF model usage, vLLM support, MPS support with HuggingFace, and more\n- Logging and usability changes\n- New tasks including CoT BIG-Bench-Hard, Belebele, user-defined task groupings, and more\n\nPlease see our updated documentation pages in `docs\u002F` for more details.\n\nDevelopment will be continuing on the `main` branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the [EleutherAI discord](https:\u002F\u002Fdiscord.gg\u002Feleutherai)!\n\n---\n\n## Overview\n\nThis project provides a unified framework to test generative language models on a large number of different evaluation tasks.\n\n**Features:**\n\n- Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.\n- Support for models loaded via [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002F) (including quantization via [GPTQModel](https:\u002F\u002Fgithub.com\u002FModelCloud\u002FGPTQModel) and [AutoGPTQ](https:\u002F\u002Fgithub.com\u002FPanQiWei\u002FAutoGPTQ)), [GPT-NeoX](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox), and [Megatron-DeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMegatron-DeepSpeed\u002F), with a flexible tokenization-agnostic interface.\n- Support for fast and memory-efficient inference with [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm).\n- Support for commercial APIs including [OpenAI](https:\u002F\u002Fopenai.com), and [TextSynth](https:\u002F\u002Ftextsynth.com\u002F).\n- Support for evaluation on adapters (e.g. LoRA) supported in [HuggingFace's PEFT library](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft).\n- Support for local models and benchmarks.\n- Evaluation with publicly available prompts ensures reproducibility and comparability between papers.\n- Easy support for custom prompts and evaluation metrics.\n\nThe Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FHuggingFaceH4\u002Fopen_llm_leaderboard), has been used in [hundreds of papers](https:\u002F\u002Fscholar.google.com\u002Fscholar?oi=bibs&hl=en&authuser=2&cites=15052937328817631261,4097184744846514103,1520777361382155671,17476825572045927382,18443729326628441434,14801318227356878622,7890865700763267262,12854182577605049984,15641002901115500560,5104500764547628290), and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.\n\n## Install\n\nTo install the `lm-eval` package from the github repository, run:\n\n```bash\ngit clone --depth 1 https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\ncd lm-evaluation-harness\npip install -e .\n```\n\n### Installing Model Backends\n\nThe base installation provides the core evaluation framework. **Model backends must be installed separately** using optional extras:\n\nFor HuggingFace transformers models:\n\n```bash\npip install \"lm_eval[hf]\"\n```\n\nFor vLLM inference:\n\n```bash\npip install \"lm_eval[vllm]\"\n```\n\nFor API-based models (OpenAI, Anthropic, etc.):\n\n```bash\npip install \"lm_eval[api]\"\n```\n\nMultiple backends can be installed together:\n\n```bash\npip install \"lm_eval[hf,vllm,api]\"\n```\n\nA detailed table of all optional extras is available at the end of this document.\n\n## Basic Usage\n\n### Documentation\n\n| Guide | Description |\n|-------|-------------|\n| [CLI Reference](.\u002Fdocs\u002Finterface.md) | Command-line arguments and subcommands |\n| [Configuration Guide](.\u002Fdocs\u002Fconfig_files.md) | YAML config file format and examples |\n| [Python API](.\u002Fdocs\u002Fpython-api.md) | Programmatic usage with `simple_evaluate()` |\n| [Task Guide](.\u002Flm_eval\u002Ftasks\u002FREADME.md) | Available tasks and task configuration |\n\nUse `lm-eval -h` to see available options, or `lm-eval run -h` for evaluation options.\n\nList available tasks with:\n\n```bash\nlm-eval ls tasks\n```\n\n### Hugging Face `transformers`\n\n> [!Important]\n> To use the HuggingFace backend, first install: `pip install \"lm_eval[hf]\"`\n\nTo evaluate a model hosted on the [HuggingFace Hub](https:\u002F\u002Fhuggingface.co\u002Fmodels) (e.g. GPT-J-6B) on `hellaswag` you can use the following command (this assumes you are using a CUDA-compatible GPU):\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI\u002Fgpt-j-6B \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\nAdditional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI\u002Fpythia-160m,revision=step100000,dtype=\"float\" \\\n    --tasks lambada_openai,hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\nModels that are loaded via both `transformers.AutoModelForCausalLM` (autoregressive, decoder-only GPT style models) and `transformers.AutoModelForSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface are supported.\n\nBatch size selection can be automated by setting the  ```--batch_size``` flag to ```auto```. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append ```:N``` to above flag to automatically recompute the largest batch size ```N``` times. For example, to recompute the batch size 4 times, the command would be:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI\u002Fpythia-160m,revision=step100000,dtype=\"float\" \\\n    --tasks lambada_openai,hellaswag \\\n    --device cuda:0 \\\n    --batch_size auto:4\n```\n\n> [!Note]\n> Just like you can provide a local path to `transformers.AutoModel`, you can also provide a local path to `lm_eval` via `--model_args pretrained=\u002Fpath\u002Fto\u002Fmodel`\n\n#### Evaluating GGUF Models\n\n`lm-eval` supports evaluating models in GGUF format using the Hugging Face (`hf`) backend. This allows you to use quantized models compatible with `transformers`, `AutoModel`, and llama.cpp conversions.\n\nTo evaluate a GGUF model, pass the path to the directory containing the model weights, the `gguf_file`, and optionally a separate `tokenizer` path using the `--model_args` flag.\n\n**🚨 Important Note:**  \nIf no separate tokenizer is provided, Hugging Face will attempt to reconstruct the tokenizer from the GGUF file — this can take **hours** or even hang indefinitely. Passing a separate tokenizer avoids this issue and can reduce tokenizer loading time from hours to seconds.\n\n**✅ Recommended usage:**\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=\u002Fpath\u002Fto\u002Fgguf_folder,gguf_file=model-name.gguf,tokenizer=\u002Fpath\u002Fto\u002Ftokenizer \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\n> [!Tip]\n> Ensure the tokenizer path points to a valid Hugging Face tokenizer directory (e.g., containing tokenizer_config.json, vocab.json, etc.).\n\n#### Multi-GPU Evaluation with Hugging Face `accelerate`\n\nWe support three main ways of using Hugging Face's [accelerate 🚀](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) library for multi-GPU evaluation.\n\nTo perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows:\n\n```bash\naccelerate launch -m lm_eval --model hf \\\n    --tasks lambada_openai,arc_easy \\\n    --batch_size 16\n```\n\n(or via `accelerate launch --no-python lm_eval`).\n\nFor cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.\n\n**WARNING**: This setup does not work with FSDP model sharding, so in `accelerate config` FSDP must be disabled, or the NO_SHARD FSDP option must be used.\n\nThe second way of using `accelerate` for multi-GPU evaluation is when your model is *too large to fit on a single GPU.*\n\nIn this setting, run the library *outside the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:\n\n```bash\nlm_eval --model hf \\\n    --tasks lambada_openai,arc_easy \\\n    --model_args parallelize=True \\\n    --batch_size 16\n```\n\nThis means that your model's weights will be split across all available GPUs.\n\nFor more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well:\n\n- `device_map_option`: How to split model weights across available GPUs. defaults to \"auto\".\n- `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.\n- `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.\n- `offload_folder`: a folder where model weights will be offloaded to disk if needed.\n\nThe third option is to use both at the same time. This will allow you to take advantage of both data parallelism and model sharding, and is especially useful for models that are too large to fit on a single GPU.\n\n```bash\naccelerate launch --multi_gpu --num_processes {nb_of_copies_of_your_model} \\\n    -m lm_eval --model hf \\\n    --tasks lambada_openai,arc_easy \\\n    --model_args parallelize=True \\\n    --batch_size 16\n```\n\nTo learn more about model parallelism and how to use it with the `accelerate` library, see the [accelerate documentation](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fv4.15.0\u002Fen\u002Fparallelism)\n\n**Warning: We do not natively support multi-node evaluation using the `hf` model type! Please reference [our GPT-NeoX library integration](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Feval.py) for an example of code in which a custom multi-machine evaluation script is written.**\n\n**Note: we do not currently support multi-node evaluations natively, and advise using either an externally hosted server to run inference requests against, or creating a custom integration with your distributed framework [as is done for the GPT-NeoX library](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Feval_tasks\u002Feval_adapter.py).**\n\n### Steered Hugging Face `transformers` models\n\nTo evaluate a Hugging Face `transformers` model with steering vectors applied, specify the model type as `steered` and provide the path to either a PyTorch file containing pre-defined steering vectors, or a CSV file that specifies how to derive steering vectors from pretrained `sparsify` or `sae_lens` models (you will need to install the corresponding optional dependency for this method).\n\nSpecify pre-defined steering vectors:\n\n```python\nimport torch\n\nsteer_config = {\n    \"layers.3\": {\n        \"steering_vector\": torch.randn(1, 768),\n        \"bias\": torch.randn(1, 768),\n        \"steering_coefficient\": 1,\n        \"action\": \"add\"\n    },\n}\ntorch.save(steer_config, \"steer_config.pt\")\n```\n\nSpecify derived steering vectors:\n\n```python\nimport pandas as pd\n\npd.DataFrame({\n    \"loader\": [\"sparsify\"],\n    \"action\": [\"add\"],\n    \"sparse_model\": [\"EleutherAI\u002Fsae-pythia-70m-32k\"],\n    \"hookpoint\": [\"layers.3\"],\n    \"feature_index\": [30],\n    \"steering_coefficient\": [10.0],\n}).to_csv(\"steer_config.csv\", index=False)\n```\n\nRun the evaluation harness with steering vectors applied:\n\n```bash\nlm_eval --model steered \\\n    --model_args pretrained=EleutherAI\u002Fpythia-160m,steer_path=steer_config.pt \\\n    --tasks lambada_openai,hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\n### NVIDIA `nemo` models\n\n[NVIDIA NeMo Framework](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo) is a generative AI framework built for researchers and pytorch developers working on language models.\n\nTo evaluate a `nemo` model, start by installing NeMo following [the documentation](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo?tab=readme-ov-file#installation). We highly recommended to use the NVIDIA PyTorch or NeMo container, especially if having issues installing Apex or any other dependencies (see [latest released containers](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo\u002Freleases)). Please also install the lm evaluation harness library following the instructions in [the Install section](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Ftree\u002Fmain?tab=readme-ov-file#install).\n\nNeMo models can be obtained through [NVIDIA NGC Catalog](https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Fmodels) or in [NVIDIA's Hugging Face page](https:\u002F\u002Fhuggingface.co\u002Fnvidia). In [NVIDIA NeMo Framework](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo\u002Ftree\u002Fmain\u002Fscripts\u002Fnlp_language_modeling) there are conversion scripts to convert the `hf` checkpoints of popular models like llama, falcon, mixtral or mpt to `nemo`.\n\nRun a `nemo` model on one GPU:\n\n```bash\nlm_eval --model nemo_lm \\\n    --model_args path=\u003Cpath_to_nemo_model> \\\n    --tasks hellaswag \\\n    --batch_size 32\n```\n\nIt is recommended to unpack the `nemo` model to avoid the unpacking inside the docker container - it may overflow disk space. For that you can run:\n\n```bash\nmkdir MY_MODEL\ntar -xvf MY_MODEL.nemo -c MY_MODEL\n```\n\n#### Multi-GPU evaluation with NVIDIA `nemo` models\n\nBy default, only one GPU is used. But we do support either data replication or tensor\u002Fpipeline parallelism during evaluation, on one node.\n\n1) To enable data replication, set the `model_args` of `devices` to the number of data replicas to run. For example, the command to run 8 data replicas over 8 GPUs is:\n\n```bash\ntorchrun --nproc-per-node=8 --no-python lm_eval \\\n    --model nemo_lm \\\n    --model_args path=\u003Cpath_to_nemo_model>,devices=8 \\\n    --tasks hellaswag \\\n    --batch_size 32\n```\n\n1) To enable tensor and\u002For pipeline parallelism, set the `model_args` of `tensor_model_parallel_size` and\u002For `pipeline_model_parallel_size`. In addition, you also have to set up `devices` to be equal to the product of `tensor_model_parallel_size` and\u002For `pipeline_model_parallel_size`. For example, the command to use one node of 4 GPUs with tensor parallelism of 2 and pipeline parallelism of 2 is:\n\n```bash\ntorchrun --nproc-per-node=4 --no-python lm_eval \\\n    --model nemo_lm \\\n    --model_args path=\u003Cpath_to_nemo_model>,devices=4,tensor_model_parallel_size=2,pipeline_model_parallel_size=2 \\\n    --tasks hellaswag \\\n    --batch_size 32\n```\n\nNote that it is recommended to substitute the `python` command by `torchrun --nproc-per-node=\u003Cnumber of devices> --no-python` to facilitate loading the model into the GPUs. This is especially important for large checkpoints loaded into multiple GPUs.\n\nNot supported yet: multi-node evaluation and combinations of data replication with tensor or pipeline parallelism.\n\n### Megatron-LM models\n\n[Megatron-LM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) is NVIDIA's large-scale transformer training framework. This backend allows direct evaluation of Megatron-LM checkpoints without conversion.\n\n**Requirements:**\n- Megatron-LM must be installed or accessible via `MEGATRON_PATH` environment variable\n- PyTorch with CUDA support\n\n**Setup:**\n\n```bash\n# Set environment variable pointing to Megatron-LM installation\nexport MEGATRON_PATH=\u002Fpath\u002Fto\u002FMegatron-LM\n```\n\n**Basic usage (single GPU):**\n\n```bash\nlm_eval --model megatron_lm \\\n    --model_args load=\u002Fpath\u002Fto\u002Fcheckpoint,tokenizer_type=HuggingFaceTokenizer,tokenizer_model=\u002Fpath\u002Fto\u002Ftokenizer \\\n    --tasks hellaswag \\\n    --batch_size 1\n```\n\n**Supported checkpoint formats:**\n- Standard Megatron checkpoints (`model_optim_rng.pt`)\n- Distributed checkpoints (`.distcp` format, auto-detected)\n\n#### Parallelism Modes\n\nThe Megatron-LM backend supports the following parallelism modes:\n\n| Mode | Configuration | Description |\n|------|---------------|-------------|\n| Single GPU | `devices=1` (default) | Standard single GPU evaluation |\n| Data Parallelism | `devices>1, TP=1` | Each GPU has a full model replica, data is distributed |\n| Tensor Parallelism | `TP == devices` | Model layers are split across GPUs |\n| Expert Parallelism | `EP == devices, TP=1` | For MoE models, experts are distributed across GPUs |\n\n> [!Note]\n> - Pipeline Parallelism (PP > 1) is not currently supported.\n> - Expert Parallelism (EP) cannot be combined with Tensor Parallelism (TP).\n\n**Data Parallelism (4 GPUs, each with full model replica):**\n\n```bash\ntorchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \\\n    --model_args load=\u002Fpath\u002Fto\u002Fcheckpoint,tokenizer_model=\u002Fpath\u002Fto\u002Ftokenizer,devices=4 \\\n    --tasks hellaswag\n```\n\n**Tensor Parallelism (TP=2):**\n\n```bash\ntorchrun --nproc-per-node=2 -m lm_eval --model megatron_lm \\\n    --model_args load=\u002Fpath\u002Fto\u002Fcheckpoint,tokenizer_model=\u002Fpath\u002Fto\u002Ftokenizer,devices=2,tensor_model_parallel_size=2 \\\n    --tasks hellaswag\n```\n\n**Expert Parallelism for MoE models (EP=4):**\n\n```bash\ntorchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \\\n    --model_args load=\u002Fpath\u002Fto\u002Fmoe_checkpoint,tokenizer_model=\u002Fpath\u002Fto\u002Ftokenizer,devices=4,expert_model_parallel_size=4 \\\n    --tasks hellaswag\n```\n\n**Using extra_args for additional Megatron options:**\n\n```bash\nlm_eval --model megatron_lm \\\n    --model_args load=\u002Fpath\u002Fto\u002Fcheckpoint,tokenizer_model=\u002Fpath\u002Fto\u002Ftokenizer,extra_args=\"--no-rope-fusion --trust-remote-code\" \\\n    --tasks hellaswag\n```\n\n> [!Note]\n> The `--use-checkpoint-args` flag is enabled by default, which loads model architecture parameters from the checkpoint. For checkpoints converted via Megatron-Bridge, this typically includes all necessary model configuration.\n\n#### Multi-GPU evaluation with OpenVINO models\n\nPipeline parallelism during evaluation is supported with OpenVINO models\n\nTo enable pipeline parallelism, set the `model_args` of `pipeline_parallel`. In addition, you also have to set up `device` to value `HETERO:\u003CGPU index1>,\u003CGPU index2>` for example `HETERO:GPU.1,GPU.0` For example, the command to use pipeline parallelism of 2 is:\n\n```bash\nlm_eval --model openvino \\\n    --tasks wikitext \\\n    --model_args pretrained=\u003Cpath_to_ov_model>,pipeline_parallel=True \\\n    --device HETERO:GPU.1,GPU.0\n```\n\n### Tensor + Data Parallel and Optimized Inference with `vLLM`\n\nWe also support vLLM for faster inference on [supported model types](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fmodels\u002Fsupported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:\n\n```bash\nlm_eval --model vllm \\\n    --model_args pretrained={model_name},tensor_parallel_size={GPUs_per_model},dtype=auto,gpu_memory_utilization=0.8,data_parallel_size={model_replicas} \\\n    --tasks lambada_openai \\\n    --batch_size auto\n```\n\nTo use vllm, do `pip install \"lm_eval[vllm]\"`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002Fe74ec966556253fbe3d8ecba9de675c77c075bce\u002Flm_eval\u002Fmodels\u002Fvllm_causallms.py) and the vLLM documentation.\n\nvLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation and provide a [script](.\u002Fscripts\u002Fmodel_comparator.py) for checking the validity of vllm results against HF.\n\n> [!Tip]\n> For fastest performance, we recommend using `--batch_size auto` for vLLM whenever possible, to leverage its continuous batching functionality!\n\n> [!Tip]\n> Passing `max_model_len=4096` or some other reasonable default to vLLM through model args may cause speedups or prevent out-of-memory errors when trying to use auto batch size, such as for Mistral-7B-v0.1 which defaults to a maximum length of 32k.\n\n### Tensor + Data Parallel and Fast Offline Batching Inference with `SGLang`\n\nWe support SGLang for efficient offline batch inference. Its **[Fast Backend Runtime](https:\u002F\u002Fdocs.sglang.ai\u002Findex.html)** delivers high performance through optimized memory management and parallel processing techniques. Key features include tensor parallelism, continuous batching, and support for various quantization methods (FP8\u002FINT4\u002FAWQ\u002FGPTQ).\n\nTo use SGLang as the evaluation backend, please **install it in advance** via SGLang documents [here](https:\u002F\u002Fdocs.sglang.io\u002Fget_started\u002Finstall.html#install-sglang).\n\n> [!Tip]\n> Due to the installing method of [`Flashinfer`](https:\u002F\u002Fdocs.flashinfer.ai\u002F)-- a fast attention kernel library, we don't include the dependencies of `SGLang` within [pyproject.toml](pyproject.toml). Note that the `Flashinfer` also has some requirements on `torch` version.\n\nSGLang's server arguments are slightly different from other backends, see [here](https:\u002F\u002Fdocs.sglang.io\u002Fadvanced_features\u002Fserver_arguments.html) for more information. We provide an example of the usage here:\n\n```bash\nlm_eval --model sglang \\\n    --model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto \\\n    --tasks gsm8k_cot \\\n    --batch_size auto\n```\n\n> [!Tip]\n> When encountering out-of-memory (OOM) errors (especially for multiple-choice tasks), try these solutions:\n>\n> 1. Use a manual `batch_size`, rather than `auto`.\n> 2. Lower KV cache pool memory usage by adjusting `mem_fraction_static` - Add to your model arguments for example `--model_args pretrained=...,mem_fraction_static=0.7`.\n> 3. Increase tensor parallel size `tp_size` (if using multiple GPUs).\n\n### Windows ML\n\nWe support **Windows ML** for hardware-accelerated inference on Windows platforms. This enables evaluation on CPU, GPU, and **NPU (Neural Processing Unit)** devices.\n\nWindows ML?\nhttps:\u002F\u002Flearn.microsoft.com\u002Fen-us\u002Fwindows\u002Fai\u002Fnew-windows-ml\u002Foverview\n\nTo use Windows ML, install the required dependencies:\n\n```bash\npip install wasdk-Microsoft.Windows.AI.MachineLearning[all] wasdk-Microsoft.Windows.ApplicationModel.DynamicDependency.Bootstrap onnxruntime-windowsml onnxruntime-genai-winml\n```\n\nEvaluate an ONNX Runtime GenAI LLM on NPU\u002FGPU\u002FCPU on Windows:\n\n```bash\nlm_eval --model winml \\\n    --model_args pretrained=\u002Fpath\u002Fto\u002Fonnx\u002Fmodel \\\n    --tasks mmlu \\\n    --batch_size 1\n```\n\n> [!Note]\n> The Windows ML backend is ONLY for ONNX Runtime GenAI model format. Models targeting `transformers.js` won't work. You can verify this by finding the `genai_config.json` file in the model folder.\n\n> [!Note]\n> To run an ONNX Runtime GenAI model on the target device, you MUST convert the original model to that vendor and device type. Converted models won't work \u002F work well on other vendor or device types. To learn more on model conversion, please visit [Microsoft AI Tool Kit](https:\u002F\u002Fcode.visualstudio.com\u002Fdocs\u002Fintelligentapps\u002Fmodelconversion)\n\n### Model APIs and Inference Servers\n\n> [!Important]\n> To use API-based models, first install: `pip install \"lm_eval[api]\"`\n\nOur library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local\u002Fself-hosted inference servers.\n\nTo call a hosted model, use:\n\n```bash\nexport OPENAI_API_KEY=YOUR_KEY_HERE\nlm_eval --model openai-completions \\\n    --model_args model=davinci-002 \\\n    --tasks lambada_openai,hellaswag\n```\n\nWe also support using your own local inference server with servers that mirror the OpenAI Completions and ChatCompletions APIs.\n\n```bash\nlm_eval --model local-completions --tasks gsm8k --model_args model=facebook\u002Fopt-125m,base_url=http:\u002F\u002F{yourip}:8000\u002Fv1\u002Fcompletions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=16\n```\n\nNote that for externally hosted models, configs such as `--device` which relate to where to place a local model should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.\n\n| API or Inference Server                                                                                                   | Implemented?                                                                                            | `--model \u003Cxxx>` name                                | Models supported:                                                                                                                                                                                                                                                                                                                                          | Request Types:                                                                 |\n|---------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|\n| OpenAI Completions                                                                                                        | :heavy_check_mark:                                                                                      | `openai-completions`, `local-completions`           | All OpenAI Completions API models                                                                                                                                                                                                                                                                                                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| OpenAI ChatCompletions                                                                                                    | :heavy_check_mark:                                                                                      | `openai-chat-completions`, `local-chat-completions` | [All ChatCompletions API models](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fguides\u002Fgpt)                                                                                                                                                                                                                                                                              | `generate_until` (no logprobs)                                                 |\n| Anthropic                                                                                                                 | :heavy_check_mark:                                                                                      | `anthropic`                                         | [Supported Anthropic Engines](https:\u002F\u002Fdocs.anthropic.com\u002Fclaude\u002Freference\u002Fselecting-a-model)                                                                                                                                                                                                                                                               | `generate_until` (no logprobs)                                                 |\n| Anthropic Chat                                                                                                            | :heavy_check_mark:                                                                                      | `anthropic-chat`, `anthropic-chat-completions`      | [Supported Anthropic Engines](https:\u002F\u002Fdocs.anthropic.com\u002Fclaude\u002Fdocs\u002Fmodels-overview)                                                                                                                                                                                                                                                                      | `generate_until` (no logprobs)                                                 |\n| Textsynth                                                                                                                 | :heavy_check_mark:                                                                                      | `textsynth`                                         | [All supported engines](https:\u002F\u002Ftextsynth.com\u002Fdocumentation.html#engines)                                                                                                                                                                                                                                                                                  | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Cohere                                                                                                                    | [:hourglass: - blocked on Cohere API bug](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F395) | N\u002FA                                                 | [All `cohere.generate()` engines](https:\u002F\u002Fdocs.cohere.com\u002Fdocs\u002Fmodels)                                                                                                                                                                                                                                                                                     | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| [Llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) (via [llama-cpp-python](https:\u002F\u002Fgithub.com\u002Fabetlen\u002Fllama-cpp-python)) | :heavy_check_mark:                                                                                      | `gguf`, `ggml`                                      | [All models supported by llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp)                                                                                                                                                                                                                                                                                | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |\n| vLLM                                                                                                                      | :heavy_check_mark:                                                                                      | `vllm`                                              | [Most HF Causal Language Models](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fmodels\u002Fsupported_models.html)                                                                                                                                                                                                                                                              | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Mamba                                                                                                                     | :heavy_check_mark:                                                                                      | `mamba_ssm`                                         | [Mamba architecture Language Models via the `mamba_ssm` package](https:\u002F\u002Fhuggingface.co\u002Fstate-spaces)                                                                                                                                                                                                                                                      | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Huggingface Optimum (Causal LMs)                                                                                          | :heavy_check_mark:                                                                                      | `openvino`                                          | Any decoder-only AutoModelForCausalLM converted with Huggingface Optimum into OpenVINO™ Intermediate Representation (IR) format                                                                                                                                                                                                                            | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Huggingface Optimum-intel IPEX (Causal LMs)                                                                               | :heavy_check_mark:                                                                                      | `ipex`                                              | Any decoder-only AutoModelForCausalLM                                                                                                                                                                                                                                                                                                                      | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Huggingface Optimum-habana (Causal LMs)                                                                               | :heavy_check_mark:                                                                                      | `habana`                                              | Any decoder-only AutoModelForCausalLM                                                                                                                                                                                                                                                                                                                      | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Neuron via AWS Inf2 (Causal LMs)                                                                                          | :heavy_check_mark:                                                                                      | `neuronx`                                           | Any decoder-only AutoModelForCausalLM supported to run on [huggingface-ami image for inferentia2](https:\u002F\u002Faws.amazon.com\u002Fmarketplace\u002Fpp\u002Fprodview-gr3e6yiscria2)                                                                                                                                                                                            | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| NVIDIA NeMo                                                                                                               | :heavy_check_mark:                                                                                      | `nemo_lm`                                           | [All supported models](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo-framework\u002Fuser-guide\u002F24.09\u002Fnemotoolkit\u002Fcore\u002Fcore.html#nemo-models)                                                                                                                                                                                                                                     | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| NVIDIA Megatron-LM                                                                                                        | :heavy_check_mark:                                                                                      | `megatron_lm`                                       | [Megatron-LM GPT models](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) (standard and distributed checkpoints)                                                                                                                                                                                                                                                     | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Watsonx.ai                                                                                                                | :heavy_check_mark:                                                                                      | `watsonx_llm`                                       | [Supported Watsonx.ai Engines](https:\u002F\u002Fdataplatform.cloud.ibm.com\u002Fdocs\u002Fcontent\u002Fwsj\u002Fanalyze-data\u002Ffm-models.html?context=wx)                                                                                                                                                                                                                                 | `generate_until` `loglikelihood`                                               |\n| Windows ML                                                                                           | :heavy_check_mark:                                                                                      | `winml`                                             | [ONNX models in GenAI format](https:\u002F\u002Fcode.visualstudio.com\u002Fdocs\u002Fintelligentapps\u002Fmodelconversion)                                                                                                                                                                                                                                                                                                                                 | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| [Your local inference server!](docs\u002FAPI_guide.md)                                                                         | :heavy_check_mark:                                                                                      | `local-completions` or `local-chat-completions`     | Support for OpenAI API-compatible servers, with easy customization for other APIs.                                                                                                                                                                                                                                                                         | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n\nModels which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs\u002Flogits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.\n\nFor more information on the different task `output_types` and model request types, see [our documentation](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002Fmain\u002Fdocs\u002Fmodel_guide.md#interface).\n\n> [!Note]\n> For best performance with closed chat model APIs such as Anthropic Claude 3 and GPT-4, we recommend carefully looking at a few sample outputs using `--limit 10` first to confirm answer extraction and scoring on generative tasks is performing as expected. providing `system=\"\u003Csome system prompt here>\"` within `--model_args` for anthropic-chat-completions, to instruct the model what format to respond in, may be useful.\n\n### Other Frameworks\n\nA number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Feval_tasks\u002Feval_adapter.py), [Megatron-DeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMegatron-DeepSpeed\u002Fblob\u002Fmain\u002Fexamples\u002FMoE\u002Freadme_evalharness.md), and [mesh-transformer-jax](https:\u002F\u002Fgithub.com\u002Fkingoflolz\u002Fmesh-transformer-jax\u002Fblob\u002Fmaster\u002Feval_harness.py).\n\nTo create your own custom integration you can follow instructions from [this tutorial](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002Fmain\u002Fdocs\u002Finterface.md#external-library-usage).\n\n### Additional Features\n\n> [!Note]\n> For tasks unsuitable for direct evaluation — either due risks associated with executing untrusted code or complexities in the evaluation process — the `--predict_only` flag is available to obtain decoded generations for post-hoc evaluation.\n\nIf you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher). **Note that the PyTorch MPS backend is still in early stages of development, so correctness issues or unsupported operations may exist. If you observe oddities in model performance on the MPS back-end, we recommend first checking that a forward pass of your model on `--device cpu` and `--device mps` match.**\n\n> [!Note]\n> You can inspect what the LM inputs look like by running the following command:\n>\n> ```bash\n> python write_out.py \\\n>     --tasks \u003Ctask1,task2,...> \\\n>     --num_fewshot 5 \\\n>     --num_examples 10 \\\n>     --output_base_path \u002Fpath\u002Fto\u002Foutput\u002Ffolder\n> ```\n>\n> This will write out one text file for each task.\n\nTo verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:\n\n```bash\nlm_eval --model openai \\\n    --model_args engine=davinci-002 \\\n    --tasks lambada_openai,hellaswag \\\n    --check_integrity\n```\n\n## Advanced Usage Tips\n\nFor models loaded with the HuggingFace  `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI\u002Fgpt-j-6b,parallelize=True,load_in_4bit=True,peft=nomic-ai\u002Fgpt4all-j-lora \\\n    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \\\n    --device cuda:0\n```\n\nModels provided as delta weights can be easily loaded using the Hugging Face transformers library. Within --model_args, set the delta argument to specify the delta weights, and use the pretrained argument to designate the relative base model to which they will be applied:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=Ejafa\u002Fllama_7B,delta=lmsys\u002Fvicuna-7b-delta-v1.1 \\\n    --tasks hellaswag\n```\n\nGPTQ quantized models can be loaded using [GPTQModel](https:\u002F\u002Fgithub.com\u002FModelCloud\u002FGPTQModel) (faster) or [AutoGPTQ](https:\u002F\u002Fgithub.com\u002FPanQiWei\u002FAutoGPTQ)\n\nGPTQModel: add `,gptqmodel=True` to `model_args`\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=model-name-or-path,gptqmodel=True \\\n    --tasks hellaswag\n```\n\nAutoGPTQ: add `,autogptq=True` to `model_args`:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \\\n    --tasks hellaswag\n```\n\nWe support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.\n\n## Saving & Caching Results\n\nTo save evaluation results provide an `--output_path`. We also support logging model responses with the `--log_samples` flag for post-hoc analysis.\n\n> [!TIP]\n> Use `--use_cache \u003CDIR>` to cache evaluation results and skip previously evaluated samples when resuming runs of the same (model, task) pairs. Note that caching is rank-dependent, so restart with the same GPU count if interrupted. You can also use --cache_requests to save dataset preprocessing steps for faster evaluation resumption.\n\nTo push results and samples to the Hugging Face Hub, first ensure an access token with write access is set in the `HF_TOKEN` environment variable. Then, use the `--hf_hub_log_args` flag to specify the organization, repository name, repository visibility, and whether to push results and samples to the Hub - [example dataset on the  HF Hub](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FKonradSzafer\u002Flm-eval-results-demo). For instance:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \\\n    --tasks hellaswag \\\n    --log_samples \\\n    --output_path results \\\n    --hf_hub_log_args hub_results_org=EleutherAI,hub_repo_name=lm-eval-results,push_results_to_hub=True,push_samples_to_hub=True,public_repo=False \\\n```\n\nThis allows you to easily download the results and samples from the Hub, using:\n\n```python\nfrom datasets import load_dataset\n\nload_dataset(\"EleutherAI\u002Flm-eval-results-private\", \"hellaswag\", \"latest\")\n```\n\nFor a full list of supported arguments, check out the [interface](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002Fmain\u002Fdocs\u002Finterface.md) guide in our documentation!\n\n## Visualizing Results\n\nYou can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights & Biases (W&B) and Zeno.\n\n### Zeno\n\nYou can use [Zeno](https:\u002F\u002Fzenoml.com) to visualize the results of your eval harness runs.\n\nFirst, head to [hub.zenoml.com](https:\u002F\u002Fhub.zenoml.com) to create an account and get an API key [on your account page](https:\u002F\u002Fhub.zenoml.com\u002Faccount).\nAdd this key as an environment variable:\n\n```bash\nexport ZENO_API_KEY=[your api key]\n```\n\nYou'll also need to install the `lm_eval[zeno]` package extra.\n\nTo visualize the results, run the eval harness with the `log_samples` and `output_path` flags.\nWe expect `output_path` to contain multiple folders that represent individual model names.\nYou can thus run your evaluation on any number of tasks and models and upload all of the results as projects on Zeno.\n\n```bash\nlm_eval \\\n    --model hf \\\n    --model_args pretrained=EleutherAI\u002Fgpt-j-6B \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8 \\\n    --log_samples \\\n    --output_path output\u002Fgpt-j-6B\n```\n\nThen, you can upload the resulting data using the `zeno_visualize` script:\n\n```bash\npython scripts\u002Fzeno_visualize.py \\\n    --data_path output \\\n    --project_name \"Eleuther Project\"\n```\n\nThis will use all subfolders in `data_path` as different models and upload all tasks within these model folders to Zeno.\nIf you run the eval harness on multiple tasks, the `project_name` will be used as a prefix and one project will be created per task.\n\nYou can find an example of this workflow in [examples\u002Fvisualize-zeno.ipynb](examples\u002Fvisualize-zeno.ipynb).\n\n### Weights and Biases\n\nWith the [Weights and Biases](https:\u002F\u002Fwandb.ai\u002Fsite) integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.\n\nThe integration provide functionalities\n\n- to automatically log the evaluation results,\n- log the samples as W&B Tables for easy visualization,\n- log the `results.json` file as an artifact for version control,\n- log the `\u003Ctask_name>_eval_samples.json` file if the samples are logged,\n- generate a comprehensive report for analysis and visualization with all the important metric,\n- log task and cli specific configs,\n- and more out of the box like the command used to run the evaluation, GPU\u002FCPU counts, timestamp, etc.\n\nFirst you'll need to install the lm_eval[wandb] package extra. Do `pip install lm_eval[wandb]`.\n\nAuthenticate your machine with an your unique W&B token. Visit https:\u002F\u002Fwandb.ai\u002Fauthorize to get one. Do `wandb login` in your command line terminal.\n\nRun eval harness as usual with a `wandb_args` flag. Use this flag to provide arguments for initializing a wandb run ([wandb.init](https:\u002F\u002Fdocs.wandb.ai\u002Fref\u002Fpython\u002Finit)) as comma separated string arguments.\n\n```bash\nlm_eval \\\n    --model hf \\\n    --model_args pretrained=microsoft\u002Fphi-2,trust_remote_code=True \\\n    --tasks hellaswag,mmlu_abstract_algebra \\\n    --device cuda:0 \\\n    --batch_size 8 \\\n    --output_path output\u002Fphi-2 \\\n    --limit 10 \\\n    --wandb_args project=lm-eval-harness-integration \\\n    --log_samples\n```\n\nIn the stdout, you will find the link to the W&B run page as well as link to the generated report. You can find an example of this workflow in [examples\u002Fvisualize-wandb.ipynb](examples\u002Fvisualize-wandb.ipynb), and an example of how to integrate it beyond the CLI.\n\n## Contributing\n\nCheck out our [open issues](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fissues) and feel free to submit pull requests!\n\nFor more information on the library and how everything fits together, see our [documentation pages](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Ftree\u002Fmain\u002Fdocs).\n\nTo get started with development, first clone the repository and install the dev dependencies:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\ncd lm-evaluation-harness\npip install -e \".[dev,hf]\"\n````\n\n### Implementing new tasks\n\nTo implement a new task in the eval harness, see [this guide](.\u002Fdocs\u002Fnew_task_guide.md).\n\nIn general, we follow this priority list for addressing concerns about prompting and other eval details:\n\n1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.\n2. If there is a clear and unambiguous official implementation, use that procedure.\n3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.\n4. If there are multiple common implementations but not universal or widespread agreement, use our preferred option among the common implementations. As before, prioritize choosing from among the implementations found in LLM training papers.\n\nThese are guidelines and not rules, and can be overruled in special circumstances.\n\nWe try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from [Language Models are Few Shot Learners](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.14165) as our original goal was specifically to compare results with that paper.\n\n### Support\n\nThe best way to get support is to open an issue on this repo or join the [EleutherAI Discord server](https:\u002F\u002Fdiscord.gg\u002Feleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!\n\n## Optional Extras\n\nExtras dependencies can be installed via `pip install -e \".[NAME]\"`\n\n### Model Backends\n\nThese extras install dependencies required to run specific model backends:\n\n| NAME           | Description                                      |\n|----------------|--------------------------------------------------|\n| hf             | HuggingFace Transformers (torch, transformers, accelerate, peft) |\n| vllm           | vLLM fast inference                              |\n| api            | API models (OpenAI, Anthropic, local servers)    |\n| gptq           | AutoGPTQ quantized models                        |\n| gptqmodel      | GPTQModel quantized models                       |\n| ibm_watsonx_ai | IBM watsonx.ai models                            |\n| ipex           | Intel IPEX backend                               |\n| habana         | Intel Gaudi backend                              |\n| optimum        | Intel OpenVINO models                            |\n| neuronx        | AWS Inferentia2 instances                        |\n| winml          | Windows ML (ONNX Runtime GenAI) - CPU\u002FGPU\u002FNPU    |\n| sparsify       | Sparsify model steering                          |\n| sae_lens       | SAELens model steering                           |\n\n### Task Dependencies\n\nThese extras install dependencies required for specific evaluation tasks:\n\n| NAME                 | Description                    |\n|----------------------|--------------------------------|\n| tasks                | All task-specific dependencies |\n| acpbench             | ACP Bench tasks                |\n| audiolm_qwen         | Qwen2 audio models             |\n| ifeval               | IFEval task                    |\n| japanese_leaderboard | Japanese LLM tasks             |\n| longbench            | LongBench tasks                |\n| math                 | Math answer checking           |\n| multilingual         | Multilingual tokenizers        |\n| ruler                | RULER tasks                    |\n\n### Development & Utilities\n\n| NAME          | Description                    |\n|---------------|--------------------------------|\n| dev           | Linting & contributions        |\n| hf_transfer   | Speed up HF downloads          |\n| sentencepiece | Sentencepiece tokenizer        |\n| unitxt        | Unitxt tasks                   |\n| wandb         | Weights & Biases logging       |\n| zeno          | Zeno result visualization      |\n\n## Cite as\n\n```text\n@misc{eval-harness,\n  author       = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},\n  title        = {The Language Model Evaluation Harness},\n  month        = 07,\n  year         = 2024,\n  publisher    = {Zenodo},\n  version      = {v0.4.3},\n  doi          = {10.5281\u002Fzenodo.12608602},\n  url          = {https:\u002F\u002Fzenodo.org\u002Frecords\u002F12608602}\n}\n```\n","# 语言模型评估框架\n\n[![DOI](https:\u002F\u002Fzenodo.org\u002Fbadge\u002FDOI\u002F10.5281\u002Fzenodo.10256836.svg)](https:\u002F\u002Fdoi.org\u002F10.5281\u002Fzenodo.10256836)\n\n---\n\n## 最新消息 📣\n- [2025\u002F12] **CLI 重构**：新增子命令（`run`、`ls`、`validate`），并通过 `--config` 参数支持 YAML 配置文件。详情请参阅 [CLI 参考文档](.\u002Fdocs\u002Finterface.md) 和 [配置指南](.\u002Fdocs\u002Fconfig_files.md)。\n- [2025\u002F12] **安装更轻量**：基础包不再包含 `transformers` 和 `torch`。模型后端需单独安装，例如：`pip install lm_eval[hf]`、`lm_eval[vllm]` 等。\n- [2025\u002F07] 为 `hf`（token\u002Fstr）、`vllm` 和 `sglang`（str）添加了 `think_end_token` 参数，用于去除支持该功能的模型中的思维链推理痕迹。\n- [2025\u002F03] 新增对 Hugging Face 模型的引导支持！\n- [2025\u002F02] 新增对 SGLang 的支持！\n- [2024\u002F09] 我们正在开发允许用户使用 LM Evaluation Harness 处理文本与图像多模态输入、文本输出任务的功能，并已作为原型特性新增了 `hf-multimodal` 和 `vllm-vlm` 模型类型以及 `mmmu` 任务。欢迎用户试用这一正在进行中的功能并进行压力测试；同时建议大家查看 [`lmms-eval`](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval)，这是一个从 lm-evaluation-harness 分支出来的优秀项目，提供了更广泛的多模态任务、模型和功能。\n- [2024\u002F07] 更新并重构了 [API 模型](docs\u002FAPI_guide.md) 支持，引入了批量和异步请求的支持，使自定义和按个人需求使用变得更加简便。**若要运行 Llama 405B，我们建议使用 VLLM 的 OpenAI 兼容 API 托管模型，并使用 `local-completions` 模型类型进行评估。**\n- [2024\u002F07] 新增了开放 LLM 排行榜任务！您可以在 [leaderboard](lm_eval\u002Ftasks\u002Fleaderboard\u002FREADME.md) 任务组中找到它们。\n\n---\n\n## 公告\n\n**lm-evaluation-harness 新版本 v0.4.0 已发布**！\n\n本次更新与新增功能包括：\n\n- **新增开放 LLM 排行榜任务！您可在 [leaderboard](lm_eval\u002Ftasks\u002Fleaderboard\u002FREADME.md) 任务组中找到它们。**\n- 内部重构\n- 基于配置的任务创建与配置\n- 更易导入和共享外部定义的任务配置 YAML 文件\n- 支持 Jinja2 提示词设计，可轻松修改提示词，并从 Promptsource 导入提示词\n- 更高级的配置选项，包括输出后处理、答案提取、每篇文档生成多个语言模型结果、可配置的少样本设置等\n- 性能提升及新增支持的建模库，包括更快的数据并行化 Hugging Face 模型使用、vLLM 支持、Hugging Face 的 MPS 支持等\n- 日志记录与可用性改进\n- 新增任务，如 CoT BIG-Bench-Hard、Belebele、用户自定义任务分组等\n\n更多详细信息请参阅 `docs\u002F` 目录下的更新文档。\n\n开发将继续在 `main` 分支上进行，我们鼓励您通过 GitHub 上的 Issues 或 PR，或在 [EleutherAI Discord](https:\u002F\u002Fdiscord.gg\u002Feleutherai) 中，向我们反馈所需功能、改进建议或提出疑问！\n\n---\n\n## 概述\n\n本项目提供了一个统一的框架，用于在大量不同的评估任务上测试生成式语言模型。\n\n**特点：**\n\n- 超过 60 个面向 LLM 的标准学术基准测试，包含数百个子任务和变体。\n- 支持通过 [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002F) 加载的模型（包括通过 [GPTQModel](https:\u002F\u002Fgithub.com\u002FModelCloud\u002FGPTQModel) 和 [AutoGPTQ](https:\u002F\u002Fgithub.com\u002FPanQiWei\u002FAutoGPTQ) 进行量化）、[GPT-NeoX](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox) 以及 [Megatron-DeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMegatron-DeepSpeed\u002F) 加载的模型，并提供灵活的、与分词器无关的接口。\n- 支持使用 [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) 进行快速且内存高效的推理。\n- 支持商业 API，包括 [OpenAI](https:\u002F\u002Fopenai.com) 和 [TextSynth](https:\u002F\u002Ftextsynth.com\u002F)。\n- 支持对 [Hugging Face 的 PEFT 库](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft) 中支持的适配器（如 LoRA）进行评估。\n- 支持本地模型和基准测试。\n- 使用公开可用的提示词进行评估，确保论文之间的可重复性和可比性。\n- 易于支持自定义提示词和评估指标。\n\nLanguage Model Evaluation Harness 是 🤗 Hugging Face 广受欢迎的 [开放 LLM 排行榜](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FHuggingFaceH4\u002Fopen_llm_leaderboard) 的后端，已被数百篇论文引用（见 Google 学术搜索：[链接](https:\u002F\u002Fscholar.google.com\u002Fscholar?oi=bibs&hl=en&authuser=2&cites=...)），并被 NVIDIA、Cohere、BigScience、BigCode、Nous Research 和 Mosaic ML 等数十家机构内部使用。\n\n## 安装\n\n从 GitHub 仓库安装 `lm-eval` 包，请执行以下命令：\n\n```bash\ngit clone --depth 1 https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\ncd lm-evaluation-harness\npip install -e .\n```\n\n### 安装模型后端\n\n基础安装仅提供核心评估框架。**模型后端需单独安装**，可通过可选依赖项实现：\n\n对于 Hugging Face Transformers 模型：\n\n```bash\npip install \"lm_eval[hf]\"\n```\n\n对于 vLLM 推理：\n\n```bash\npip install \"lm_eval[vllm]\"\n```\n\n对于基于 API 的模型（OpenAI、Anthropic 等）：\n\n```bash\npip install \"lm_eval[api]\"\n```\n\n也可同时安装多个后端：\n\n```bash\npip install \"lm_eval[hf,vllm,api]\"\n```\n\n本文档末尾提供了所有可选依赖项的详细表格。\n\n## 基本用法\n\n### 文档\n\n| 指南 | 描述 |\n|-------|-------------|\n| [CLI 参考文档](.\u002Fdocs\u002Finterface.md) | 命令行参数和子命令 |\n| [配置指南](.\u002Fdocs\u002Fconfig_files.md) | YAML 配置文件格式及示例 |\n| [Python API 指南](.\u002Fdocs\u002Fpython-api.md) | 使用 `simple_evaluate()` 进行程序化调用 |\n| [任务指南](.\u002Flm_eval\u002Ftasks\u002FREADME.md) | 可用任务及任务配置 |\n\n使用 `lm-eval -h` 查看可用选项，或使用 `lm-eval run -h` 查看评估选项。\n\n列出可用任务：\n\n```bash\nlm-eval ls tasks\n```\n\n### Hugging Face `transformers`\n\n> [!重要]\n> 要使用 HuggingFace 后端，首先安装：`pip install \"lm_eval[hf]\"`\n\n要在 `hellaswag` 数据集上评估托管在 [HuggingFace Hub](https:\u002F\u002Fhuggingface.co\u002Fmodels) 上的模型（例如 GPT-J-6B），可以使用以下命令（假设您正在使用兼容 CUDA 的 GPU）：\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI\u002Fgpt-j-6B \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\n可以使用 `--model_args` 标志向模型构造函数传递额外参数。最值得注意的是，这支持在 Hub 上使用 `revisions` 功能来存储部分训练好的检查点，或者指定运行模型的数据类型：\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI\u002Fpythia-160m,revision=step100000,dtype=\"float\" \\\n    --tasks lambada_openai,hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\nHuggingFace 支持通过 `transformers.AutoModelForCausalLM`（自回归、仅解码器的 GPT 风格模型）和 `transformers.AutoModelForSeq2SeqLM`（如 T5 等编码器-解码器模型）加载的模型。\n\n可以通过将 `--batch_size` 标志设置为 `auto` 来自动选择批次大小。这将自动检测适合您设备的最大批次大小。在最长和最短样本之间差异较大的任务中，定期重新计算最大批次大小有助于进一步提高速度。为此，可以在上述标志后附加 `:N`，以自动重新计算最大批次大小 `N` 次。例如，要重新计算批次大小 4 次，命令如下：\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI\u002Fpythia-160m,revision=step100000,dtype=\"float\" \\\n    --tasks lambada_openai,hellaswag \\\n    --device cuda:0 \\\n    --batch_size auto:4\n```\n\n> [!注释]\n> 就像您可以为 `transformers.AutoModel` 提供本地路径一样，也可以通过 `--model_args pretrained=\u002Fpath\u002Fto\u002Fmodel` 为 `lm_eval` 提供本地路径。\n\n#### 评估 GGUF 模型\n\n`lm-eval` 支持使用 Hugging Face (`hf`) 后端评估 GGUF 格式的模型。这使您可以使用与 `transformers`、`AutoModel` 和 llama.cpp 转换兼容的量化模型。\n\n要评估 GGUF 模型，请使用 `--model_args` 标志传递包含模型权重的目录路径、`gguf_file`，以及可选的单独 `tokenizer` 路径。\n\n**🚨 重要提示：**  \n如果未提供单独的分词器，HuggingFace 将尝试从 GGUF 文件中重建分词器——这可能需要 **数小时**，甚至无限期挂起。提供单独的分词器可以避免此问题，并将分词器加载时间从数小时缩短至几秒钟。\n\n**✅ 推荐用法：**\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=\u002Fpath\u002Fto\u002Fgguf_folder,gguf_file=model-name.gguf,tokenizer=\u002Fpath\u002Fto\u002Ftokenizer \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\n> [!提示]\n> 确保分词器路径指向有效的 HuggingFace 分词器目录（例如包含 tokenizer_config.json、vocab.json 等文件）。\n\n#### 使用 Hugging Face `accelerate` 进行多 GPU 评估\n\n我们支持三种主要方式来使用 HuggingFace 的 [accelerate 🚀](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) 库进行多 GPU 评估。\n\n要执行 *数据并行评估*（每个 GPU 加载模型的 **独立完整副本**），我们可以利用 `accelerate` 启动器，如下所示：\n\n```bash\naccelerate launch -m lm_eval --model hf \\\n    --tasks lambada_openai,arc_easy \\\n    --batch_size 16\n```\n\n（或通过 `accelerate launch --no-python lm_eval`）。\n\n对于模型可以容纳在单个 GPU 上的情况，这种方式可以让您在 K 个 GPU 上以比单个 GPU 快 K 倍的速度进行评估。\n\n**警告**：此设置不适用于 FSDP 模型分片，因此在 `accelerate config` 中必须禁用 FSDP，或使用 NO_SHARD FSDP 选项。\n\n使用 `accelerate` 进行多 GPU 评估的第二种方式是当您的模型 *太大而无法容纳在单个 GPU 上时*。\n\n在这种情况下，应在 `accelerate` 启动器之外运行该库，但需将 `parallelize=True` 传递给 `--model_args`，如下所示：\n\n```bash\nlm_eval --model hf \\\n    --tasks lambada_openai,arc_easy \\\n    --model_args parallelize=True \\\n    --batch_size 16\n```\n\n这意味着您的模型权重将被拆分到所有可用的 GPU 上。\n\n对于更高级的用户或更大的模型，当 `parallelize=True` 时，还可以使用以下参数：\n\n- `device_map_option`：如何在可用 GPU 之间拆分模型权重，默认为 “auto”。\n- `max_memory_per_gpu`：每块 GPU 在加载模型时使用的最大显存。\n- `max_cpu_memory`：将模型权重卸载到 RAM 时使用的最大 CPU 内存量。\n- `offload_folder`：如果需要，模型权重将被卸载到磁盘的文件夹。\n\n第三种选择是同时使用前两种方法。这将使您能够同时利用数据并行性和模型分片，尤其适用于太大而无法容纳在单个 GPU 上的模型。\n\n```bash\naccelerate launch --multi_gpu --num_processes {nb_of_copies_of_your_model} \\\n    -m lm_eval --model hf \\\n    --tasks lambada_openai,arc_easy \\\n    --model_args parallelize=True \\\n    --batch_size 16\n```\n\n要了解有关模型并行性及其如何与 `accelerate` 库一起使用的更多信息，请参阅 [accelerate 文档](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fv4.15.0\u002Fen\u002Fparallelism)。\n\n**警告：我们目前不原生支持使用 `hf` 模型类型的多节点评估！请参考 [我们的 GPT-NeoX 库集成](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Feval.py)，其中包含一个自定义多机评估脚本的示例。**\n\n**注意：我们目前不原生支持多节点评估，建议使用外部托管的服务器来处理推理请求，或根据您的分布式框架创建自定义集成 [如同 GPT-NeoX 库所做的那样](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Feval_tasks\u002Feval_adapter.py)。**\n\n### 带有引导向量的 Hugging Face `transformers` 模型\n\n要使用引导向量评估 Hugging Face `transformers` 模型，需将模型类型指定为 `steered`，并提供包含预定义引导向量的 PyTorch 文件路径，或指定如何从预训练的 `sparsify` 或 `sae_lens` 模型中提取引导向量的 CSV 文件路径（此方法需要安装相应的可选依赖项）。\n\n指定预定义的引导向量：\n\n```python\nimport torch\n\nsteer_config = {\n    \"layers.3\": {\n        \"steering_vector\": torch.randn(1, 768),\n        \"bias\": torch.randn(1, 768),\n        \"steering_coefficient\": 1,\n        \"action\": \"add\"\n    },\n}\ntorch.save(steer_config, \"steer_config.pt\")\n```\n\n指定派生的引导向量：\n\n```python\nimport pandas as pd\n\npd.DataFrame({\n    \"loader\": [\"sparsify\"],\n    \"action\": [\"add\"],\n    \"sparse_model\": [\"EleutherAI\u002Fsae-pythia-70m-32k\"],\n    \"hookpoint\": [\"layers.3\"],\n    \"feature_index\": [30],\n    \"steering_coefficient\": [10.0],\n}).to_csv(\"steer_config.csv\", index=False)\n```\n\n运行应用引导向量的评估工具：\n\n```bash\nlm_eval --model steered \\\n    --model_args pretrained=EleutherAI\u002Fpythia-160m,steer_path=steer_config.pt \\\n    --tasks lambada_openai,hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\n### NVIDIA `nemo` 模型\n\n[NVIDIA NeMo Framework](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo) 是一个生成式 AI 框架，专为从事语言模型研究的科研人员和 PyTorch 开发者设计。\n\n要评估 `nemo` 模型，首先按照[文档](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo?tab=readme-ov-file#installation)安装 NeMo。我们强烈建议使用 NVIDIA 的 PyTorch 或 NeMo 容器，尤其是在安装 Apex 或其他依赖项时遇到问题的情况下（请参阅[最新发布的容器](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo\u002Freleases)）。此外，请按照[安装部分](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Ftree\u002Fmain?tab=readme-ov-file#install)中的说明安装 lm 评估工具库。\n\nNemo 模型可以通过 [NVIDIA NGC Catalog](https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Fmodels) 或 [NVIDIA 的 Hugging Face 页面](https:\u002F\u002Fhuggingface.co\u002Fnvidia)获取。在 [NVIDIA NeMo Framework](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo\u002Ftree\u002Fmain\u002Fscripts\u002Fnlp_language_modeling) 中，提供了用于将流行模型如 llama、falcon、mixtral 或 mpt 的 `hf` 检查点转换为 `nemo` 格式的脚本。\n\n在单个 GPU 上运行 `nemo` 模型：\n\n```bash\nlm_eval --model nemo_lm \\\n    --model_args path=\u003Cnemo_model_path> \\\n    --tasks hellaswag \\\n    --batch_size 32\n```\n\n建议先解压 `nemo` 模型，以避免在 Docker 容器内进行解压操作导致磁盘空间不足。可以执行以下命令：\n\n```bash\nmkdir MY_MODEL\ntar -xvf MY_MODEL.nemo -c MY_MODEL\n```\n\n#### 使用 NVIDIA `nemo` 模型进行多 GPU 评估\n\n默认情况下，仅使用一个 GPU。但我们支持在单节点上进行数据复制或张量\u002F流水线并行计算的评估。\n\n1) 要启用数据复制，需将 `model_args` 中的 `devices` 设置为要运行的数据副本数量。例如，在 8 个 GPU 上运行 8 个数据副本的命令如下：\n\n```bash\ntorchrun --nproc-per-node=8 --no-python lm_eval \\\n    --model nemo_lm \\\n    --model_args path=\u003Cnemo_model_path>,devices=8 \\\n    --tasks hellaswag \\\n    --batch_size 32\n```\n\n1) 要启用张量和\u002F或流水线并行计算，需设置 `model_args` 中的 `tensor_model_parallel_size` 和\u002F或 `pipeline_model_parallel_size`。此外，还需将 `devices` 设置为 `tensor_model_parallel_size` 和\u002F或 `pipeline_model_parallel_size` 的乘积。例如，在 4 个 GPU 的节点上使用张量并行度 2 和流水线并行度 2 的命令如下：\n\n```bash\ntorchrun --nproc-per-node=4 --no-python lm_eval \\\n    --model nemo_lm \\\n    --model_args path=\u003Cnemo_model_path>,devices=4,tensor_model_parallel_size=2,pipeline_model_parallel_size=2 \\\n    --tasks hellaswag \\\n    --batch_size 32\n```\n\n请注意，建议用 `torchrun --nproc-per-node=\u003C设备数量> --no-python` 替代 `python` 命令，以便更高效地将模型加载到各个 GPU 中。这对于加载到多个 GPU 上的大规模检查点尤为重要。\n\n目前尚不支持：多节点评估以及数据复制与张量或流水线并行计算的组合。\n\n### Megatron-LM 模型\n\n[Megatron-LM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) 是 NVIDIA 的大规模 Transformer 训练框架。该后端允许直接评估 Megatron-LM 检查点，无需转换。\n\n**要求：**\n- 必须安装 Megatron-LM，或通过 `MEGATRON_PATH` 环境变量访问\n- 支持 CUDA 的 PyTorch\n\n**设置：**\n\n```bash\nSimplified Chinese:\n\n# 设置指向 Megatron-LM 安装路径的环境变量\nexport MEGATRON_PATH=\u002Fpath\u002Fto\u002FMegatron-LM\n```\n\n**基本用法（单 GPU）：**\n\n```bash\nlm_eval --model megatron_lm \\\n    --model_args load=\u002Fpath\u002Fto\u002Fcheckpoint,tokenizer_type=HuggingFaceTokenizer,tokenizer_model=\u002Fpath\u002Fto\u002Ftokenizer \\\n    --tasks hellaswag \\\n    --batch_size 1\n```\n\n**支持的检查点格式：**\n- 标准 Megatron 检查点 (`model_optim_rng.pt`)\n- 分布式检查点（`.distcp` 格式，自动检测）\n\n#### 并行模式\n\nMegatron-LM 后端支持以下并行模式：\n\n| 模式 | 配置 | 描述 |\n|------|-------|------|\n| 单 GPU | `devices=1`（默认） | 标准单 GPU 评估 |\n| 数据并行 | `devices>1, TP=1` | 每个 GPU 拥有完整的模型副本，数据被分发 |\n| 张量并行 | `TP == devices` | 模型层在 GPU 之间拆分 |\n| 专家并行 | `EP == devices, TP=1` | 对于 MoE 模型，专家分布在各个 GPU 上 |\n\n> [!注]\n> - 流水线并行（PP > 1）目前不支持。\n> - 专家并行（EP）不能与张量并行（TP）结合使用。\n\n**数据并行（4 个 GPU，每个 GPU 拥有完整模型副本）：**\n\n```bash\ntorchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \\\n    --model_args load=\u002Fpath\u002Fto\u002Fcheckpoint,tokenizer_model=\u002Fpath\u002Fto\u002Ftokenizer,devices=4 \\\n    --tasks hellaswag\n```\n\n**张量并行（TP=2）：**\n\n```bash\ntorchrun --nproc-per-node=2 -m lm_eval --model megatron_lm \\\n    --model_args load=\u002Fpath\u002Fto\u002Fcheckpoint,tokenizer_model=\u002Fpath\u002Fto\u002Ftokenizer,devices=2,tensor_model_parallel_size=2 \\\n    --tasks hellaswag\n```\n\n**MoE 模型的专家并行（EP=4）：**\n\n```bash\ntorchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \\\n    --model_args load=\u002Fpath\u002Fto\u002Fmoe_checkpoint,tokenizer_model=\u002Fpath\u002Fto\u002Ftokenizer,devices=4,expert_model_parallel_size=4 \\\n    --tasks hellaswag\n```\n\n**使用 extra_args 添加额外的 Megatron 选项：**\n\n```bash\nlm_eval --model megatron_lm \\\n    --model_args load=\u002Fpath\u002Fto\u002Fcheckpoint,tokenizer_model=\u002Fpath\u002Fto\u002Ftokenizer,extra_args=\"--no-rope-fusion --trust-remote-code\" \\\n    --tasks hellaswag\n```\n\n> [!注]\n> 默认启用 `--use-checkpoint-args` 标志，该标志会从检查点中加载模型架构参数。对于通过 Megatron-Bridge 转换的检查点，这通常包括所有必要的模型配置。\n\n#### 使用 OpenVINO 模型的多 GPU 评估\n\nOpenVINO 模型支持在评估过程中进行流水线并行。\n\n要启用流水线并行，需设置 `model_args` 中的 `pipeline_parallel` 参数。此外，还需将 `device` 设置为 `HETERO:\u003CGPU 索引1>,\u003CGPU 索引2>` 的形式，例如 `HETERO:GPU.1,GPU.0`。例如，使用 2 级流水线并行的命令如下：\n\n```bash\nlm_eval --model openvino \\\n    --tasks wikitext \\\n    --model_args pretrained=\u003Cpath_to_ov_model>,pipeline_parallel=True \\\n    --device HETERO:GPU.1,GPU.0\n```\n\n### 使用 vLLM 进行张量 + 数据并行及优化推理\n\n我们还支持 vLLM，以加快对[受支持的模型类型](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fmodels\u002Fsupported_models.html)的推理速度，尤其是在将模型拆分到多个 GPU 上时更为显著。无论是单 GPU 还是多 GPU——张量并行、数据并行，或两者的组合——都可以进行推理，例如：\n\n```bash\nlm_eval --model vllm \\\n    --model_args pretrained={model_name},tensor_parallel_size={GPUs_per_model},dtype=auto,gpu_memory_utilization=0.8,data_parallel_size={model_replicas} \\\n    --tasks lambada_openai \\\n    --batch_size auto\n```\n\n要使用 vLLM，请运行 `pip install \"lm_eval[vllm]\"`。有关 vLLM 支持的完整配置列表，请参考我们的 [vLLM 集成文档](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002Fe74ec966556253fbe3d8ecba9de675c77c075bce\u002Flm_eval\u002Fmodels\u002Fvllm_causallms.py)以及 vLLM 的官方文档。\n\nvLLM 的输出有时会与 Huggingface 不同。我们将 Huggingface 视为参考实现，并提供了一个[脚本](.\u002Fscripts\u002Fmodel_comparator.py)，用于对比 vLLM 结果与 Huggingface 的一致性。\n\n> [!提示]\n> 为了获得最佳性能，我们建议尽可能对 vLLM 使用 `--batch_size auto`，以充分利用其连续批处理功能！\n\n> [!提示]\n> 通过模型参数向 vLLM 传递 `max_model_len=4096` 或其他合理的默认值，可能会提升速度或避免在尝试使用自动批大小时出现内存不足的问题，例如对于 Mistral-7B-v0.1，默认最大长度为 32k。\n\n### 使用 SGLang 进行张量 + 数据并行及快速离线批处理推理\n\n我们支持 SGLang，用于高效的离线批处理推理。其 **[Fast Backend Runtime](https:\u002F\u002Fdocs.sglang.ai\u002Findex.html)** 通过优化的内存管理和并行处理技术，提供了卓越的性能。关键特性包括张量并行、连续批处理以及对多种量化方法（FP8\u002FINT4\u002FAWQ\u002FGPTQ）的支持。\n\n要将 SGLang 作为评估后端，请务必**提前安装**，具体步骤请参阅 SGLang 文档[这里](https:\u002F\u002Fdocs.sglang.io\u002Fget_started\u002Finstall.html#install-sglang)。\n\n> [!提示]\n> 由于 [`Flashinfer`](https:\u002F\u002Fdocs.flashinfer.ai\u002F)——一个快速注意力核库——的安装方式，我们在 [pyproject.toml](pyproject.toml) 中并未包含 SGLang 的依赖项。请注意，`Flashinfer` 对 `torch` 版本也有一些要求。\n\nSGLang 的服务器参数与其他后端略有不同，更多信息请参阅[这里](https:\u002F\u002Fdocs.sglang.io\u002Fadvanced_features\u002Fserver_arguments.html)。以下是使用示例：\n\n```bash\nlm_eval --model sglang \\\n    --model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto \\\n    --tasks gsm8k_cot \\\n    --batch_size auto\n```\n\n> [!提示]\n> 当遇到内存不足（OOM）错误时（尤其是选择题任务），可以尝试以下解决方案：\n>\n> 1. 使用手动 `batch_size`，而不是 `auto`。\n> 2. 通过调整 `mem_fraction_static` 来降低 KV 缓存池的内存使用量——例如，在模型参数中添加 `--model_args pretrained=...,mem_fraction_static=0.7`。\n> 3. 增加张量并行规模 `tp_size`（如果使用多个 GPU）。\n\n### Windows ML\n\n我们支持在 Windows 平台上使用 **Windows ML** 进行硬件加速推理。这使得模型可以在 CPU、GPU 以及 **NPU（神经处理单元）** 设备上进行评估。\n\n什么是 Windows ML？\nhttps:\u002F\u002Flearn.microsoft.com\u002Fen-us\u002Fwindows\u002Fai\u002Fnew-windows-ml\u002Foverview\n\n要使用 Windows ML，请安装所需的依赖项：\n\n```bash\npip install wasdk-Microsoft.Windows.AI.MachineLearning[all] wasdk-Microsoft.Windows.ApplicationModel.DynamicDependency.Bootstrap onnxruntime-windowsml onnxruntime-genai-winml\n```\n\n在 Windows 上使用 NPU\u002FGPU\u002FCPU 评估 ONNX Runtime GenAI LLM：\n\n```bash\nlm_eval --model winml \\\n    --model_args pretrained=\u002Fpath\u002Fto\u002Fonnx\u002Fmodel \\\n    --tasks mmlu \\\n    --batch_size 1\n```\n\n> [!注]\n> Windows ML 后端仅适用于 ONNX Runtime GenAI 模型格式。针对 `transformers.js` 的模型将无法运行。您可以通过在模型文件夹中查找 `genai_config.json` 文件来验证这一点。\n\n> [!注]\n> 要在目标设备上运行 ONNX Runtime GenAI 模型，您必须将原始模型转换为该厂商和设备类型的格式。转换后的模型在其他厂商或设备类型上可能无法正常工作或效果不佳。如需了解更多关于模型转换的信息，请访问 [Microsoft AI 工具包](https:\u002F\u002Fcode.visualstudio.com\u002Fdocs\u002Fintelligentapps\u002Fmodelconversion)。\n\n### 模型 API 和推理服务器\n\n> [!重要]\n> 若要使用基于 API 的模型，首先需要安装：`pip install \"lm_eval[api]\"`\n\n我们的库还支持通过多个商业 API 提供的服务模型的评估，并计划实现对最常用的高性能本地\u002F自托管推理服务器的支持。\n\n要调用托管模型，可以使用以下命令：\n\n```bash\nexport OPENAI_API_KEY=YOUR_KEY_HERE\nlm_eval --model openai-completions \\\n    --model_args model=davinci-002 \\\n    --tasks lambada_openai,hellaswag\n```\n\n我们还支持使用您自己的本地推理服务器，这些服务器能够镜像 OpenAI Completions 和 ChatCompletions API。\n\n```bash\nlm_eval --model local-completions --tasks gsm8k --model_args model=facebook\u002Fopt-125m,base_url=http:\u002F\u002F{yourip}:8000\u002Fv1\u002Fcompletions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=16\n```\n\n请注意，对于外部托管的模型，与本地模型位置相关的配置参数（如 `--device`）不应使用，且不会生效。就像您可以使用 `--model_args` 向本地模型的构造函数传递任意参数一样，也可以使用它向托管模型的 API 传递任意参数。有关他们支持哪些参数的信息，请参阅托管服务的文档。\n\n| API 或推理服务器                                                                                                   | 已实现？                                                                                            | `--model \u003Cxxx>` 名称                                | 支持的模型：                                                                                                                                                                                                                                                                                                                                          | 请求类型：                                                                 |\n|---------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|\n| OpenAI Completions                                                                                                        | √                                                                                      | `openai-completions`, `local-completions`           | 所有 OpenAI Completions API 模型                                                                                                                                                                                                                                                                                                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| OpenAI ChatCompletions                                                                                                    | √                                                                                      | `openai-chat-completions`, `local-chat-completions` | [所有 ChatCompletions API 模型](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fguides\u002Fgpt)                                                                                                                                                                                                                                                                              | `generate_until`（无 logprobs）                                                 |\n| Anthropic                                                                                                                 | √                                                                                      | `anthropic`                                         | [支持的 Anthropic 引擎](https:\u002F\u002Fdocs.anthropic.com\u002Fclaude\u002Freference\u002Fselecting-a-model)                                                                                                                                                                                                                                                               | `generate_until`（无 logprobs）                                                 |\n| Anthropic Chat                                                                                                            | √                                                                                      | `anthropic-chat`, `anthropic-chat-completions`      | [支持的 Anthropic 引擎](https:\u002F\u002Fdocs.anthropic.com\u002Fclaude\u002Fdocs\u002Fmodels-overview)                                                                                                                                                                                                                                                                      | `generate_until`（无 logprobs）                                                 |\n| Textsynth                                                                                                                 | √                                                                                      | `textsynth`                                         | [所有支持的引擎](https:\u002F\u002Ftextsynth.com\u002Fdocumentation.html#engines)                                                                                                                                                                                                                                                                                  | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Cohere                                                                                                                    | [:hourglass: - 受 Cohere API 问题阻塞](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F395) | 无                                                  | [所有 `cohere.generate()` 引擎](https:\u002F\u002Fdocs.cohere.com\u002Fdocs\u002Fmodels)                                                                                                                                                                                                                                                                                     | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| [Llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp)（通过 [llama-cpp-python](https:\u002F\u002Fgithub.com\u002Fabetlen\u002Fllama-cpp-python)） | √                                                                                      | `gguf`, `ggml`                                      | [Llama.cpp 支持的所有模型](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp)                                                                                                                                                                                                                                                                                | `generate_until`, `loglikelihood`,（困惑度评估尚未实现）                       |\n| vLLM                                                                                                                      | √                                                                                      | `vllm`                                              | [大多数 HF 因果语言模型](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fmodels\u002Fsupported_models.html)                                                                                                                                                                                                                                                              | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Mamba                                                                                                                     | √                                                                                      | `mamba_ssm`                                         | [通过 `mamba_ssm` 包支持的 Mamba 架构语言模型](https:\u002F\u002Fhuggingface.co\u002Fstate-spaces)                                                                                                                                                                                                                                                      | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Huggingface Optimum（因果 LM）                                                                                            | √                                                                                      | `openvino`                                          | 任何使用 Huggingface Optimum 转换为 OpenVINO™ 中间表示 (IR) 格式的解码器式 AutoModelForCausalLM                                                                                                                                                                                                                            | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Huggingface Optimum-intel IPEX（因果 LM）                                                                               | √                                                                                      | `ipex`                                              | 任何解码器式 AutoModelForCausalLM                                                                                                                                                                                                                                                                                                                      | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Huggingface Optimum-habana（因果 LM）                                                                               | √                                                                                      | `habana`                                              | 任何解码器式 AutoModelForCausalLM                                                                                                                                                                                                                                                                                                                      | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Neuron 通过 AWS Inf2（因果 LM）                                                                                          | √                                                                                      | `neuronx`                                           | 任何支持在 [huggingface-ami image for inferentia2](https:\u002F\u002Faws.amazon.com\u002Fmarketplace\u002Fpp\u002Fprodview-gr3e6yiscria2) 上运行的解码器式 AutoModelForCausalLM                                                                                                                                                                                            | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| NVIDIA NeMo                                                                                                               | √                                                                                      | `nemo_lm`                                           | [所有支持的模型](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo-framework\u002Fuser-guide\u002F24.09\u002Fnemotoolkit\u002Fcore\u002Fcore.html#nemo-models)                                                                                                                                                                                                                                     | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| NVIDIA Megatron-LM                                                                                                        | √                                                                                      | `megatron_lm`                                       | [Megatron-LM GPT 模型](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM)（标准和分布式检查点）                                                                                                                                                                                                                                                     | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Watsonx.ai                                                                                                                | √                                                                                      | `watsonx_llm`                                       | [支持的 Watsonx.ai 引擎](https:\u002F\u002Fdataplatform.cloud.ibm.com\u002Fdocs\u002Fcontent\u002Fwsj\u002Fanalyze-data\u002Ffm-models.html?context=wx)                                                                                                                                                                                                                                 | `generate_until` `loglikelihood`                                               |\n| Windows ML                                                                                           | √                                                                                      | `winml`                                             | [GenAI 格式的 ONNX 模型](https:\u002F\u002Fcode.visualstudio.com\u002Fdocs\u002Fintelligentapps\u002Fmodelconversion)                                                                                                                                                                                                                                                                                                                                 | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| [您本地的推理服务器！](docs\u002FAPI_guide.md)                                                                         | √                                                                                      | `local-completions` 或 `local-chat-completions`     | 支持与 OpenAI API 兼容的服务器，并可轻松自定义其他 API。                                                                                                                                                                                                                                                                         | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n\nModels which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs\u002Flogits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.\n\nFor more information on the different task `output_types` and model request types, see [our documentation](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002Fmain\u002Fdocs\u002Fmodel_guide.md#interface).\n\n> [!Note]\n> For best performance with closed chat model APIs such as Anthropic Claude 3 and GPT-4, we recommend carefully looking at a few sample outputs using `--limit 10` first to confirm answer extraction and scoring on generative tasks is performing as expected. providing `system=\"\u003Csome system prompt here>\"` within `--model_args` for anthropic-chat-completions, to instruct the model what format to respond in, may be useful.\n\n\n\n### Other Frameworks\n\nA number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Feval_tasks\u002Feval_adapter.py), [Megatron-DeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMegatron-DeepSpeed\u002Fblob\u002Fmain\u002Fexamples\u002FMoE\u002Freadme_evalharness.md), and [mesh-transformer-jax](https:\u002F\u002Fgithub.com\u002Fkingoflolz\u002Fmesh-transformer-jax\u002Fblob\u002Fmaster\u002Feval_harness.py).\n\nTo create your own custom integration you can follow instructions from [this tutorial](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002Fmain\u002Fdocs\u002Finterface.md#external-library-usage).\n\n### Additional Features\n\n> [!Note]\n> For tasks unsuitable for direct evaluation — either due risks associated with executing untrusted code or complexities in the evaluation process — the `--predict_only` flag is available to obtain decoded generations for post-hoc evaluation.\n\nIf you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher). **Note that the PyTorch MPS backend is still in early stages of development, so correctness issues or unsupported operations may exist. If you observe oddities in model performance on the MPS back-end, we recommend first checking that a forward pass of your model on `--device cpu` and `--device mps` match.**\n\n> [!Note]\n> You can inspect what the LM inputs look like by running the following command:\n>\n> ```bash\n> python write_out.py \\\n>     --tasks \u003Ctask1,task2,...> \\\n>     --num_fewshot 5 \\\n>     --num_examples 10 \\\n>     --output_base_path \u002Fpath\u002Fto\u002Foutput\u002Ffolder\n> ```\n>\n> This will write out one text file for each task.\n\nTo verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:\n\n```bash\nlm_eval --model openai \\\n    --model_args engine=davinci-002 \\\n    --tasks lambada_openai,hellaswag \\\n    --check_integrity\n```\n\n## Advanced Usage Tips\n\nFor models loaded with the HuggingFace  `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI\u002Fgpt-j-6b,parallelize=True,load_in_4bit=True,peft=nomic-ai\u002Fgpt4all-j-lora \\\n    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \\\n    --device cuda:0\n```\n\nModels provided as delta weights can be easily loaded using the Hugging Face transformers library. Within --model_args, set the delta argument to specify the delta weights, and use the pretrained argument to designate the relative base model to which they will be applied:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=Ejafa\u002Fllama_7B,delta=lmsys\u002Fvicuna-7b-delta-v1.1 \\\n    --tasks hellaswag\n```\n\nGPTQ quantized models can be loaded using [GPTQModel](https:\u002F\u002Fgithub.com\u002FModelCloud\u002FGPTQModel) (faster) or [AutoGPTQ](https:\u002F\u002Fgithub.com\u002FPanQiWei\u002FAutoGPTQ)\n\nGPTQModel: add `,gptqmodel=True` to `model_args`\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=model-name-or-path,gptqmodel=True \\\n    --tasks hellaswag\n```\n\nAutoGPTQ: add `,autogptq=True` to `model_args`:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \\\n    --tasks hellaswag\n```\n\nWe support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.\n\n## Saving & Caching Results\n\nTo save evaluation results provide an `--output_path`. We also support logging model responses with the `--log_samples` flag for post-hoc analysis.\n\n> [!TIP]\n> Use `--use_cache \u003CDIR>` to cache evaluation results and skip previously evaluated samples when resuming runs of the same (model, task) pairs. Note that caching is rank-dependent, so restart with the same GPU count if interrupted. You can also use --cache_requests to save dataset preprocessing steps for faster evaluation resumption.\n\nTo push results and samples to the Hugging Face Hub, first ensure an access token with write access is set in the `HF_TOKEN` environment variable. Then, use the `--hf_hub_log_args` flag to specify the organization, repository name, repository visibility, and whether to push results and samples to the Hub - [example dataset on the  HF Hub](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FKonradSzafer\u002Flm-eval-results-demo). For instance:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \\\n    --tasks hellaswag \\\n    --log_samples \\\n    --output_path results \\\n    --hf_hub_log_args hub_results_org=EleutherAI,hub_repo_name=lm-eval-results,push_results_to_hub=True,push_samples_to_hub=True,public_repo=False \\\n```\n\nThis allows you to easily download the results and samples from the Hub, using:\n\n```python\nfrom datasets import load_dataset\n\nload_dataset(\"EleutherAI\u002Flm-eval-results-private\", \"hellaswag\", \"latest\")\n```\n\nFor a full list of supported arguments, check out the [interface](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002Fmain\u002Fdocs\u002Finterface.md) guide in our documentation!\n\n## Visualizing Results\n\nYou can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights & Biases (W&B) and Zeno.\n\n### Zeno\n\nYou can use [Zeno](https:\u002F\u002Fzenoml.com) to visualize the results of your eval harness runs.\n\nFirst, head to [hub.zenoml.com](https:\u002F\u002Fhub.zenoml.com) to create an account and get an API key [on your account page](https:\u002F\u002Fhub.zenoml.com\u002Faccount).\nAdd this key as an environment variable:\n\n```bash\nexport ZENO_API_KEY=[your api key]\n```\n\nYou'll also need to install the `lm_eval[zeno]` package extra.\n\nTo visualize the results, run the eval harness with the `log_samples` and `output_path` flags.\nWe expect `output_path` to contain multiple folders that represent individual model names.\nYou can thus run your evaluation on any number of tasks and models and upload all of the results as projects on Zeno.\n\n```bash\nlm_eval \\\n    --model hf \\\n    --model_args pretrained=EleutherAI\u002Fgpt-j-6B \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8 \\\n    --log_samples \\\n    --output_path output\u002Fgpt-j-6B\n```\n\nThen, you can upload the resulting data using the `zeno_visualize` script:\n\n```bash\npython scripts\u002Fzeno_visualize.py \\\n    --data_path output \\\n    --project_name \"Eleuther Project\"\n```\n\nThis will use all subfolders in `data_path` as different models and upload all tasks within these model folders to Zeno.\nIf you run the eval harness on multiple tasks, the `project_name` will be used as a prefix and one project will be created per task.\n\nYou can find an example of this workflow in [examples\u002Fvisualize-zeno.ipynb](examples\u002Fvisualize-zeno.ipynb).\n\n### Weights and Biases\n\nWith the [Weights and Biases](https:\u002F\u002Fwandb.ai\u002Fsite) integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.\n\nThe integration provide functionalities\n\n- to automatically log the evaluation results,\n- log the samples as W&B Tables for easy visualization,\n- log the `results.json` file as an artifact for version control,\n- log the `\u003Ctask_name>_eval_samples.json` file if the samples are logged,\n- generate a comprehensive report for analysis and visualization with all the important metric,\n- log task and cli specific configs,\n- and more out of the box like the command used to run the evaluation, GPU\u002FCPU counts, timestamp, etc.\n\nFirst you'll need to install the lm_eval[wandb] package extra. Do `pip install lm_eval[wandb]`.\n\nAuthenticate your machine with an your unique W&B token. Visit https:\u002F\u002Fwandb.ai\u002Fauthorize to get one. Do `wandb login` in your command line terminal.\n\nRun eval harness as usual with a `wandb_args` flag. Use this flag to provide arguments for initializing a wandb run ([wandb.init](https:\u002F\u002Fdocs.wandb.ai\u002Fref\u002Fpython\u002Finit)) as comma separated string arguments.\n\n```bash\nlm_eval \\\n    --model hf \\\n    --model_args pretrained=microsoft\u002Fphi-2,trust_remote_code=True \\\n    --tasks hellaswag,mmlu_abstract_algebra \\\n    --device cuda:0 \\\n    --batch_size 8 \\\n    --output_path output\u002Fphi-2 \\\n    --limit 10 \\\n    --wandb_args project=lm-eval-harness-integration \\\n    --log_samples\n```\n\nIn the stdout, you will find the link to the W&B run page as well as link to the generated report. You can find an example of this workflow in [examples\u002Fvisualize-wandb.ipynb](examples\u002Fvisualize-wandb.ipynb), and an example of how to integrate it beyond the CLI.\n\n## Contributing\n\nCheck out our [open issues](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fissues) and feel free to submit pull requests!\n\nFor more information on the library and how everything fits together, see our [documentation pages](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Ftree\u002Fmain\u002Fdocs).\n\nTo get started with development, first clone the repository and install the dev dependencies:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\ncd lm-evaluation-harness\npip install -e \".[dev,hf]\"\n````\n\n### Implementing new tasks\n\nTo implement a new task in the eval harness, see [this guide](.\u002Fdocs\u002Fnew_task_guide.md).\n\nIn general, we follow this priority list for addressing concerns about prompting and other eval details:\n\n1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.\n2. If there is a clear and unambiguous official implementation, use that procedure.\n3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.\n4. If there are multiple common implementations but not universal or widespread agreement, use our preferred option among the common implementations. As before, prioritize choosing from among the implementations found in LLM training papers.\n\nThese are guidelines and not rules, and can be overruled in special circumstances.\n\nWe try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from [Language Models are Few Shot Learners](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.14165) as our original goal was specifically to compare results with that paper.\n\n### Support\n\nThe best way to get support is to open an issue on this repo or join the [EleutherAI Discord server](https:\u002F\u002Fdiscord.gg\u002Feleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!\n\n## Optional Extras\n\nExtras dependencies can be installed via `pip install -e \".[NAME]\"`\n\n### Model Backends\n\nThese extras install dependencies required to run specific model backends:\n\n| NAME           | Description                                      |\n|----------------|--------------------------------------------------|\n| hf             | HuggingFace Transformers (torch, transformers, accelerate, peft) |\n| vllm           | vLLM fast inference                              |\n| api            | API models (OpenAI, Anthropic, local servers)    |\n| gptq           | AutoGPTQ quantized models                        |\n| gptqmodel      | GPTQModel quantized models                       |\n| ibm_watsonx_ai | IBM watsonx.ai models                            |\n| ipex           | Intel IPEX backend                               |\n| habana         | Intel Gaudi backend                              |\n| optimum        | Intel OpenVINO models                            |\n| neuronx        | AWS Inferentia2 instances                        |\n| winml          | Windows ML (ONNX Runtime GenAI) - CPU\u002FGPU\u002FNPU    |\n| sparsify       | Sparsify model steering                          |\n| sae_lens       | SAELens model steering                           |\n\n### 任务依赖\n\n这些额外包会安装特定评估任务所需的依赖项：\n\n| 名称                 | 描述                    |\n|----------------------|-------------------------|\n| tasks                | 所有任务相关的依赖      |\n| acpbench             | ACP Bench 任务         |\n| audiolm_qwen         | Qwen2 音频模型         |\n| ifeval               | IFEval 任务           |\n| japanese_leaderboard | 日语 LLM 任务          |\n| longbench            | LongBench 任务         |\n| math                 | 数学答案校验           |\n| multilingual         | 多语言分词器           |\n| ruler                | RULER 任务            |\n\n### 开发与工具\n\n| 名称          | 描述                    |\n|---------------|-------------------------|\n| dev           | 代码检查与贡献          |\n| hf_transfer   | 加速 Hugging Face 下载    |\n| sentencepiece | SentencePiece 分词器    |\n| unitxt        | Unitxt 任务            |\n| wandb         | Weights & Biases 日志记录 |\n| zeno          | Zeno 结果可视化         |\n\n## 引用方式\n\n```text\n@misc{eval-harness,\n  author       = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},\n  title        = {The Language Model Evaluation Harness},\n  month        = 07,\n  year         = 2024,\n  publisher    = {Zenodo},\n  version      = {v0.4.3},\n  doi          = {10.5281\u002Fzenodo.12608602},\n  url          = {https:\u002F\u002Fzenodo.org\u002Frecords\u002F12608602}\n}\n```","# lm-evaluation-harness 快速上手指南\n\n`lm-evaluation-harness` 是由 EleutherAI 开发的统一框架，用于在大量基准测试任务上评估生成式语言模型。它是 Hugging Face Open LLM Leaderboard 的后端引擎，支持超过 60 种学术基准测试。\n\n## 环境准备\n\n*   **操作系统**: Linux (推荐), macOS, Windows (部分功能可能受限)\n*   **Python**: 3.8 或更高版本\n*   **硬件**: \n    *   推荐使用 NVIDIA GPU (需安装 CUDA)\n    *   支持 Apple Silicon (MPS)\n    *   也可仅使用 CPU (速度较慢)\n*   **前置依赖**: `git`, `pip`\n\n> **国内加速建议**：建议使用国内镜像源加速 Python 包下载。\n> ```bash\n> export PIP_INDEX_URL=https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 安装步骤\n\n该工具采用模块化安装策略，基础包不包含具体的模型后端（如 `transformers` 或 `torch`），需根据需求单独安装。\n\n### 1. 克隆仓库并安装基础包\n\n```bash\ngit clone --depth 1 https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\ncd lm-evaluation-harness\npip install -e .\n```\n\n### 2. 安装模型后端\n\n根据你的模型来源选择对应的后端进行安装（可多选）：\n\n*   **Hugging Face Transformers 模型** (最常用):\n    ```bash\n    pip install \"lm_eval[hf]\"\n    ```\n\n*   **vLLM 推理加速**:\n    ```bash\n    pip install \"lm_eval[vllm]\"\n    ```\n\n*   **商业 API 模型** (OpenAI, Anthropic 等):\n    ```bash\n    pip install \"lm_eval[api]\"\n    ```\n\n*   **同时安装多个后端**:\n    ```bash\n    pip install \"lm_eval[hf,vllm,api]\"\n    ```\n\n## 基本使用\n\n### 查看可用任务\n\n在运行评估前，可以先查看支持的评测任务列表：\n\n```bash\nlm-eval ls tasks\n```\n\n### 最简单的评估示例 (Hugging Face 模型)\n\n以下命令演示如何评估 Hugging Face Hub 上的 `EleutherAI\u002Fgpt-j-6B` 模型在 `hellaswag` 任务上的表现。\n\n**前提**：已执行 `pip install \"lm_eval[hf]\"` 且机器配有 CUDA GPU。\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI\u002Fgpt-j-6B \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\n**参数说明：**\n*   `--model hf`: 指定使用 Hugging Face 后端。\n*   `--model_args pretrained=...`: 指定模型名称或本地路径。\n*   `--tasks`: 指定要运行的评测任务（多个任务可用逗号分隔，如 `lambada_openai,hellaswag`）。\n*   `--device`: 指定运行设备。\n*   `--batch_size`: 批处理大小。若设为 `auto`，系统将自动探测显存允许的最大批次大小。\n\n### 进阶提示：自动批处理大小\n\n如果不确定合适的 `batch_size`，可以使用 `auto` 模式让工具自动检测：\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI\u002Fpythia-160m \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size auto\n```","某 AI 初创团队在发布自研的 7B 参数大模型前，急需验证其在逻辑推理和常识问答等核心能力上是否达到行业基准，以便向投资人展示性能报告。\n\n### 没有 lm-evaluation-harness 时\n- **评估标准混乱**：团队需手动从 Hugging Face 下载 MMLU、GSM8K 等数据集，各自编写脚本处理格式，导致不同成员跑出的分数因预处理差异而无法横向对比。\n- **适配成本高昂**：每切换一种推理后端（如从 Transformers 切换到 vLLM 以加速测试），都要重写大量数据加载和推理代码，耗时数天且容易引入 Bug。\n- **结果不可复现**：缺乏统一的少样本（Few-shot）提示词模板和管理机制，实验记录杂乱，难以精确复现某次特定的评测配置供审计或论文发表。\n- **多维任务割裂**：想要同时评估数学、代码和多语言能力时，需要维护多套独立的测试流水线，无法一次性生成综合性能雷达图。\n\n### 使用 lm-evaluation-harness 后\n- **基准统一规范**：直接调用内置的 60+ 学术基准任务（如 Open LLM Leaderboard 系列），一键拉取标准化数据，确保评测结果与社区权威榜单完全可比。\n- **后端切换无缝**：利用其灵活的接口，仅通过修改命令行参数（如 `--model vllm`）即可在分钟级内切换推理引擎，无需改动任何业务逻辑代码。\n- **配置可追溯**：通过 YAML 配置文件统一管理提示词模板、少样本数量及后处理规则，轻松实现实验版本的精确复现和团队间共享。\n- **全景效率提升**：单条命令即可并行执行多个任务组，自动聚合生成包含准确率、困惑度等多维指标的详细报告，大幅缩短从训练到发布的验证周期。\n\nlm-evaluation-harness 将原本碎片化、高成本的模型验证工作转化为标准化、自动化的流水线，成为大模型迭代中不可或缺的“标尺”。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FEleutherAI_lm-evaluation-harness_bc9fc5f3.png","EleutherAI","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FEleutherAI_cadf9bbb.png","",null,"contact@eleuther.ai","AIEleuther","www.eleuther.ai","https:\u002F\u002Fgithub.com\u002FEleutherAI",[85,89,93],{"name":86,"color":87,"percentage":88},"Python","#3572A5",99.4,{"name":90,"color":91,"percentage":92},"Shell","#89e051",0.3,{"name":94,"color":95,"percentage":96},"C++","#f34b7d",0.2,12011,3158,"2026-04-05T10:29:46","MIT","Linux, macOS","非必需（支持 CPU\u002FAPI），若使用本地模型推荐 NVIDIA GPU（支持 CUDA）；macOS 支持 MPS；显存大小取决于模型规模（支持多卡并行及模型切分）","未说明（取决于模型大小，大模型需大量内存或启用 offload）",{"notes":105,"python":106,"dependencies":107},"基础安装包不包含 transformers 和 torch，需根据后端单独安装（如 pip install lm_eval[hf]）；支持 GGUF 格式模型评估，建议单独指定 tokenizer 路径以避免加载卡顿；支持多 GPU 数据并行及模型权重切分；支持 macOS MPS 加速。","未说明",[108,109,110,111,112,113],"torch (可选，通过 lm_eval[hf] 安装)","transformers (可选，通过 lm_eval[hf] 安装)","vllm (可选，通过 lm_eval[vllm] 安装)","accelerate (用于多 GPU 评估)","peft (用于适配器评估)","GPTQModel\u002FAutoGPTQ (用于量化模型)",[13,54,26],[116,117,118],"evaluation-framework","language-model","transformer","2026-03-27T02:49:30.150509","2026-04-06T07:13:02.508215",[122,127,132,137,142,146],{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},9380,"在使用 OpenAI ChatCompletions API 进行多项选择题评估时，如何解决不支持 logits（对数概率）的问题？","目前如果 API 端点不直接暴露 logits 选项，则无法直接获取基于 loglikelihood（对数似然）任务所需的信息。不过，可以通过“生成 + 精确匹配”（generation and exact match）的方式来衡量多项选择题基准。社区成员已尝试在 PR #2601 中实现此功能，适用于简单场景。若需完整支持，需等待官方更新或参考 OpenAI 的 evals 框架自行实现。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fissues\u002F1196",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},9381,"如何在 lm-eval 中为聊天模型（Chat Models）配置自定义的对话模板（Chat Template）？","该功能已被合并到主分支。用户可以通过命令行参数 `--chat_template` 或在 `model_args` 中传递 `chat_template=...` 来指定模板。对于自定义模板，可以参考社区 Fork 版本或使用支持聊天模板的外部库。注意：在使用 Few-shot（少样本）时，聊天模板可能会导致分数下降，而在 Zero-shot（零样本）设置下通常表现更好，特别是多项选择任务。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fissues\u002F1098",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},9382,"下载 pubmedqa 任务数据时失败报错，应该如何解决？","这通常是由于本地安装的 `lm-eval` 版本过旧或与当前代码库冲突导致的。解决方法是先完全卸载现有版本，然后直接从 GitHub 主分支安装最新版。具体命令如下：\n1. `pip uninstall lm-eval`\n2. `pip install git+https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness`\n安装完成后重新运行下载脚本即可成功。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fissues\u002F312",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},9383,"lm-eval 是否支持评估 OPT、Llama 或 Alpaca 等模型？如何配置？","支持。无需扩展模型代码，直接使用现有的 `hf-causal` 模型类型即可。只需将 `--model_args` 设置为对应的 Hugging Face 模型路径。例如评估 Llama-7B 的命令如下：\n`python main.py --model hf-causal --model_args pretrained=decapoda-research\u002Fllama-7b-hf --device cuda --tasks boolq,piqa,hellaswag,winogrande,arc_easy,arc_challenge,copa,openbookqa`\n确保模型已转换为 HF 格式或直接使用 HF Hub 上的地址。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fissues\u002F401",{"id":143,"question_zh":144,"answer_zh":145,"source_url":131},9384,"在使用聊天模板进行评估时，Few-shot（少样本）和 Zero-shot（零样本）的表现有何差异？","根据社区测试数据，在使用聊天模板时，5-shot（五样本）设置的得分通常会下降；而在 0-shot（零样本）设置下表现更好，尤其是对于 `multiple_choice`（多项选择）类任务。因此，建议在评估聊天模型时，优先尝试 zero-shot 设置以获得更稳定的结果，或者对比有无模板在不同 shot 设置下的差异。",{"id":147,"question_zh":148,"answer_zh":149,"source_url":141},9385,"如果在 TriviaQA 数据集上的评估结果为 0，可能是什么原因？","虽然 Issue 中未给出最终定论，但此类问题通常与答案提取逻辑（answer extraction）或提示词格式（prompt formatting）不匹配有关。建议检查模型输出格式是否符合 TriviaQA 的预期答案格式，或者确认是否使用了正确的 few-shot 示例。如果是使用生成式任务，可能需要调整答案解析的正则表达式或后处理逻辑。",[151,156,161,166,171,176,181,186,191,196,201,206,211,216,221,226,231],{"id":152,"version":153,"summary_zh":154,"released_at":155},106745,"v0.4.11","## v0.4.11 Release Notes\r\n\r\nMinor release. Stay tuned for bigger changes next release.\r\n\r\n### New Platform Support\r\n\r\n* **Windows ML Backend** — Native Windows ML inference support by @chapsiru and @chemwolf6922 in #3470, #3564, #3565\r\n\r\n### New Benchmarks & Tasks\r\n\r\n* **BEAR** knowledge probe by @plonerma in #3496\r\n\r\n### Task Version Changes\r\n\r\n> The following tasks have updated versions. Results from a previous task versions may not be directly comparable. See the linked PRs or individual task READMEs for changelogs.\r\n\r\n`afrobench_belebele` (all variants): 2 → 3 in #3551\r\n`evalita_llm`: 0.0 → 0.1 in #3551\r\n`include` (all 90 language variants): 0.0 → 0.1 in #3551\r\n`mgsm_direct` (all 11 language variants): 3.0 → 4.0 by @LakshyaChaudhry in #3574\r\n\r\n### Fixes & Improvements\r\n\r\n* Fixed **SQuAD v2** evaluation by @HydrogenSulfate in #3535\r\n* Fixed **MasakhaNEWS** tasks — replaced non-existent `headline_text` field with `headline` by @Mr-Neutr0n in #3567\r\n* Fixed incorrect task configs by @baberabb in #3552\r\n* Replaced `eval()` with `ast.literal_eval` in task configs for safer parsing by @baberabb in #3577\r\n* Fixed **SGLang** duplicate registration error by @enpimashin in #3543\r\n* Restored **`hf_transfer`** import check by @baberabb in #3563\r\n* Fixed `modify_gen_kwargs` call in **vLLM VLMs** by @hmellor in #3573\r\n* Refactored **vLLM** `gen_kwargs` normalization inline to `modify_gen_kwargs`; fixed cached `gen_kwargs` mutation by @baberabb in #3582\r\n* Fixed README for task-listing CLI command by @UltimateJupiter in #3545\r\n* Updated dependencies by @baberabb in #3546\r\n\r\n## New Contributors\r\n* @HydrogenSulfate made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3535\r\n* @UltimateJupiter made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3545\r\n* @enpimashin made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3543\r\n* @chapsiru made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3470\r\n* @chemwolf6922 made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3565\r\n* @plonerma made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3496\r\n* @hmellor made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3573\r\n* @Mr-Neutr0n made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3567\r\n* @LakshyaChaudhry made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3574\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fcompare\u002Fv0.4.10...v0.4.11","2026-02-13T20:21:47",{"id":157,"version":158,"summary_zh":159,"released_at":160},106746,"v0.4.10","## Highlights\r\n\r\nThe big change this release: the base package no longer installs model backends by default. We've also added new benchmarks and expanded multilingual support.\r\n\r\n### Breaking Change: Lightweight Core with Optional Backends\r\n\r\n**`pip install lm_eval` no longer installs the HuggingFace\u002Ftorch stack by default.** (#3428)\r\n\r\nThe core package no longer includes backends. Install them explicitly:\r\n\r\n```bash\r\npip install lm_eval          # core only, no model backends\r\npip install lm_eval[hf]      # HuggingFace backend (transformers, torch, accelerate)\r\npip install lm_eval[vllm]    # vLLM backend\r\npip install lm_eval[api]     # API backends (OpenAI, Anthropic, etc.)\r\n```\r\n\r\n**Additional breaking change:** Accessing model classes via attribute no longer works:\r\n\r\n```python\r\n# This still works:\r\nfrom lm_eval.models.huggingface import HFLM\r\n\r\n# This now raises AttributeError:\r\nimport lm_eval.models\r\nlm_eval.models.huggingface.HFLM\r\n```\r\n\r\n### CLI Refactor\r\n\r\nThe CLI now uses explicit subcommands and supports YAML config files (#3440):\r\n\r\n```bash\r\nlm-eval run --model hf --tasks hellaswag      # run evaluations\r\nlm-eval run --config my_config.yaml           # load args from YAML config\r\nlm-eval ls tasks                               # list available tasks\r\nlm-eval validate --tasks hellaswag,arc_easy   # validate task configs\r\n```\r\n\r\nBackward compatible when omitting `run` still works: `lm-eval --model hf --tasks hellaswag`\r\n\r\nSee `lm-eval --help` or the [CLI documentation](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002Fmain\u002Fdocs\u002Finterface.md) for details.\r\n\r\n### Other Improvements\r\n\r\n* **Decoupled `ContextSampler`** with new `build_qa_turn` helper (#3429)\r\n* **Normalized `gen_kwargs`** with `truncation_side` support for vLLM (#3509)\r\n\r\n## New Benchmarks & Tasks\r\n\r\n* **PISA** task by @HallerPatrick in #3412\r\n* **SLR-Bench** (Scalable Logical Reasoning Benchmark) by @Ahmad21Omar in #3305\r\n* **OpenAI Multilingual MMLU** by @Helw150 in #3473\r\n* **ULQA** benchmark by @keramjan in #3340\r\n* **IFEval** in Spanish and Catalan by @juliafalcao in #3467\r\n* **TruthfulQA-VA** for Catalan by @sgs97ua in #3469\r\n* **Multiple Bangla benchmarks** by @Ismail-Hossain-1 in #3454\r\n* **NeurIPS E2LM Competition submissions**: Team Shaikespear, Morai, and Noor by @younesbelkada in #3437, #3443, #3444\r\n\r\n## Model Support\r\n\r\n* **Ministral-3** adapter (`hf-mistral3`) by @medhakimbedhief in #3487\r\n\r\n## Fixes & Improvements\r\n\r\n### Task Fixes\r\n\r\n* Fixed leading whitespace leakage in **MMLU-Pro** by @baberabb in #3500\r\n* Fixed `gen_prefix` delimiter handling in multiple-choice tasks by @baberabb in #3508\r\n* Fixed MGSM stop criteria in Iberian languages by @juliafalcao in #3465\r\n* Fixed `a=0` as valid answer index in `build_qa_turn` by @ezylopx5 in #3488\r\n* Fixed `fewshot_config` not being applied to fewshot docs by @baberabb in #3461\r\n* Updated GSM8K, WinoGrande, and SuperGLUE to use full HF dataset paths by @baberabb in #3523, #3525, #3527\r\n* Fixed `gsm8k_cot_llama` `target_delimiter` issue by @baberabb in #3526\r\n* Updated LIBRA task utils by @bond005 in #3520\r\n\r\n### Backend Fixes\r\n\r\n* Fixed vLLM off-by-one `max_length` error by @baberabb in #3503\r\n* Resolved deprecated `vllm.transformers_utils.get_tokenizer` import by @DarkLight1337 in #3482\r\n* Fixed SGLang import and removed duplicate tasks by @baberabb in #3492\r\n* Removed deprecated `AutoModelForVision2Seq` by @baberabb in #3522\r\n* Fixed Anthropic chat model mapping by @lucafossen in #3453\r\n* Fixed bug preventing `=` sign in checkpoint names by @mrinaldi97 in #3517\r\n* Fixed `pretty_print_task` for external custom configs by @safikhanSoofiyani in #3436\r\n* Fixed CLI regressions by @fxmarty-amd in #3449\r\n\r\n## New Contributors\r\n* @safikhanSoofiyani made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3436\r\n* @lucafossen made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3453\r\n* @Ahmad21Omar made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3305\r\n* @ezylopx5 made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3488\r\n* @juliafalcao made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3467\r\n* @medhakimbedhief made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3487\r\n* @ntenenz made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3489\r\n* @keramjan made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3340\r\n* @bond005 made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3520\r\n* @mrinaldi97 made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3517\r\n* @wogns3623 made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3523\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-eval","2026-01-27T19:56:53",{"id":162,"version":163,"summary_zh":164,"released_at":165},106747,"v0.4.9.2","This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: **Python 3.10 is now the minimum required version**.\r\n\r\n### New Benchmarks & Tasks\r\n\r\nA big wave of new evaluation tasks this release:\r\n\r\n* **AIME** and **MATH500** math reasoning benchmarks by @jannalulu in #3248, #3311\r\n* **BabiLong** and **Longbench v2** for long-context evaluation by @jannalulu in #3287, #3338\r\n* **GraphWalks** by @jannalulu in #3377\r\n* **ZhoBLiMP**, **BLiMP-NL**, **TurBLiMP**, **LM-SynEval**, and **BHS** linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265\r\n* **Icelandic WinoGrande** by @jmichaelov in #3277\r\n* **CLIcK** Korean benchmark by @shing100 in #3173\r\n* **MMLU-Redux** (generative) and Spanish translation by @luiscosio in #2705\r\n* **EsBBQ** and **CaBBQ** bias benchmarks by @valleruizf in #3167\r\n* **EQBench** in Spanish and Catalan by @priverabsc in #3168\r\n* **Anthropic discrim-eval** by @Helw150 in #3091\r\n* **XNLI-VA** by @FranValero97 in #3194\r\n* **Bangla MMLU** (Titulm) by @Ismail-Hossain-1 in #3317\r\n* **HumanEval infilling** by @its-alpesh in #3299\r\n* **CNN-DailyMail 3.0.0** by @preordinary in #3426\r\n* **Global PIQA** and new `acc_norm_bytes` metric by @baberabb in #3368\r\n\r\n### Fixes & Improvements\r\n\r\n**Core Changes:**\r\n* **Python 3.10 minimum** by @jannalulu in #3337\r\n* **Unpinned `datasets`** library by @baberabb in #3316\r\n* **BOS token handling**: Delegate to tokenizer; `add_bos_token` now defaults to `None` by @baberabb in #3347\r\n* Renamed `LOGLEVEL` env var to `LMEVAL_LOG_LEVEL` to avoid conflicts by @fxmarty-amd in #3418\r\n* Resolve duplicate task names with safeguards by @giuliolovisotto in #3394\r\n\r\n**Task Fixes:**\r\n* Fixed MMLU-Redux to exclude samples without `error_type=\"ok\"` and display summary table by @fxmarty-amd in #3410, #3406\r\n* Fixed AIME answer extraction by @jannalulu in #3353\r\n* Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361\r\n* Fixed `crows_pairs` dataset by @jannalulu in #3378\r\n* Fixed Gemma tokenizer `add_bos_token` not updating by @DarkLight1337 in #3206\r\n* Fixed `lambada_multilingual_stablelm` by @jmichaelov, @HallerPatrick in #3294, #3222\r\n* Fixed CodeXGLUE by @gsaltintas in #3238\r\n* Pinned correct MMLUSR version by @christinaexyou in #3350\r\n* Updated `minerva_math` by @baberabb in #3259\r\n\r\n**Backend Fixes:**\r\n* Fixed vLLM import errors when not installed by @fxmarty-amd in #3292\r\n* Fixed vLLM `data_parallel_size>1` issue by @Dornavineeth in #3303\r\n* Resolved deprecated `vllm.utils.get_open_port` by @DarkLight1337 in #3398\r\n* Fixed GPT series model bugs by @zinccat in #3348\r\n* Fixed PIL image hashing to use actual bytes by @tboerstad in #3331\r\n* Fixed `additional_config` parsing by @brian-dellabetta in #3393\r\n* Fixed batch chunking seed handling with groupby by @slimfrkha in #3047\r\n* Fixed no-output error handling by @Oseltamivir in #3395\r\n* Replaced deprecated `torch_dtype` with `dtype` by @AbdulmalikDS in #3415\r\n* Fixed custom task config reading by @SkyR0ver in #3425\r\n\r\n### Model & Backend Support\r\n\r\n* **OpenAI GPT-5** support by @babyplutokurt in #3247\r\n* **Azure OpenAI** support by @zinccat in #3349\r\n* **Fine-tuned Gemma3** evaluation support by @LearnerSXH in #3234\r\n* **OpenVINO text2text** models by @nikita-savelyevv in #3101\r\n* **Intel XPU** support for HFLM by @kaixuanliu in #3211\r\n* **Attention head steering** support by @luciaquirke in #3279\r\n* Leverage vLLM's `tokenizer_info` endpoint to avoid manual duplication by @m-misiura in #3185\r\n\r\n## What's Changed\r\n* Remove `trust_remote_code: True` from updated datasets by @Avelina9X in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3213\r\n* Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3234\r\n* Fix `add_bos_token` not updated for Gemma tokenizer by @DarkLight1337 in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3206\r\n* remove incomplete compilation instructions, solves #3233 by @ceferisbarov in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3242\r\n* Update utils.py by @Anri-Lombard in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3246\r\n* Adding support for OpenAI GPT-5 model by @babyplutokurt in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3247\r\n* Add xnli_va dataset by @FranValero97 in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3194\r\n* Add ZhoBLiMP benchmark by @jmichaelov in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3218\r\n* Add BLiMP-NL by @jmichaelov in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3221\r\n* Add TurBLiMP by @jmichaelov in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3219\r\n* Add LM-SynEval Benchmark by @jmichaelov in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3184\r\n* Fix unknown group key to tag in yaml config for `lambada_multilingual_stablelm` by @HallerPat","2025-11-26T23:27:06",{"id":167,"version":168,"summary_zh":169,"released_at":170},106748,"v0.4.9.1","# lm-eval v0.4.9.1 Release Notes\r\n\r\nThis v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to tackle common community pain points. We'll do our best to keep things from breaking, but we anticipate a few changes might not be fully backward-compatible. We're excited to share more soon!\r\n\r\n### Enhanced Reasoning Model Handling\r\n* Better support for reasoning models with a `think_end_token` argument to strip intermediate reasoning from outputs for the `hf`, `vllm`, and `sglang` model backends. A related `enable_thinking` argument was also added for specific models that support it (e.g., Qwen).\r\n\r\n## New Benchmarks & Tasks\r\n* EgyMMLU and EgyHellaSwag by @houdaipha in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3063\r\n* MultiBLiMP benchmark by @jmichaelov in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3155\r\n* LIBRA benchmark for long-context evaluation by @karimovaSvetlana in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2943\r\n* Multilingual Truthfulqa in Spanish, Basque and Galician by @BlancaCalvo in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3062\r\n\r\n## Fixes & Improvements\r\n### Tasks & Benchmarks:\r\n* Aligned Humaneval results for Llama-3.1-70B-Instruct with official scores by @userljz, @baberabb, @idantene in (https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3201. [#3092](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3092), [#3102](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3102))\r\n* Fixed incorrect dataset paths for GLUE and medical benchmarks by @Avelina9X and @idantene. ([#3159](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3159), [#3151](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3151))\r\n* Removed redundant \"Let's think step by step\" text from `bbh_cot_fewshot` prompts by @philipdoldo. ([#3140](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3140))\r\n* Increased `max_gen_toks` to 2048 for HRM8K math benchmarks by @shing100. ([#3124](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3124))\r\n\r\n### Backend & Stability:\r\n* Reduce CLI loading time from 2.2s to 0.05s by @stakodiak. ([#3099](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3099))\r\n* Fixed a process hang caused by mp.Pool in bootstrap_stderr and introduced `DISABLE_MULTIPROC` envar by @ankitgola005 and @neel04. ([#3135](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3135), [#3106](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3106))\r\n* add image hashing and `LMEVAL_HASHMM` envar by @artemorloff in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2973\r\n* TaskManager: `include-path` precedence handling to prioritize custom dir over default by @parkhs21 in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3068\r\n\r\n## Housekeeping:\r\n* Pinned `datasets \u003C 4.0.0` temporarily to maintain compatibility with `trust_remote_code` by @baberabb. ([#3172](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3172))\r\n* Removed models from Neural Magic and other unneeded files by @baberabb. ([#3112](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3112), [#3113](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3113), [#3108](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3108))\r\n\r\n## What's Changed\r\n* llama3 task: update README.md by @annafontanaa in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3074\r\n* Fix Anthropic API compatibility issues in chat completions by @NourFahmy in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3054\r\n* Ensure backwards compatibility in `fewshot_context` by using kwargs by @kiersten-stokes in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3079\r\n* [vllm] remove system message if `TemplateError` for chat_template by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3076\r\n* feat \u002F fix: Properly make use of `subfolder` from HF models by @younesbelkada in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3072\r\n* [HF] fix quantization config by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3039\r\n* FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct by @userljz in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3092\r\n* Truthfulqa multi harness by @BlancaCalvo in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3062\r\n* Fix: Reduce CLI loading time from 2.2s to 0.05s by @stakodiak in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3099\r\n* Humaneval - fix regression by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3102\r\n* Bugfix\u002Fhf tokenizer gguf override by @ankush13r in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3098\r\n* [FIX] Initial code to disable multi-proc for stderr by @neel04 in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F3106\r\n* fix deps; update hooks by @baberabb in https:\u002F\u002Fgithub.com\u002FEl","2025-08-04T11:36:05",{"id":172,"version":173,"summary_zh":174,"released_at":175},106749,"v0.4.9","# lm-eval v0.4.9 Release Notes\r\n\r\n## Key Improvements\r\n\r\n* **Enhanced Backend Support**:\r\n   * **SGLang Generate API** by **@baberabb** in #2997\r\n   * **vLLM enhancements**: Added support for `enable_thinking` argument (#2947) and data parallel for V1 (#3011) by **@anmarques** and **@baberabb**\r\n   * **Chat template improvements**: Extended vLLM chat template support (#2902) and fixed HF chat template resolution (#2992) by **@anmarques** and **@fxmarty-amd**\r\n\r\n* **Multimodal Capabilities**:\r\n   * **Audio modality support** for Qwen2 Audio models by **@artemorloff** in #2689\r\n   * **Image processing improvements**: Added resize images support (#2958) and enabled multimodal API usage (#2981) by **@artemorloff** and **@baberabb**\r\n   * **ChartQA** multimodal task implementation by **@baberabb** in #2544\r\n\r\n* **Performance & Reliability**:\r\n   * **Quantization support** added via `quantization_config` by **@jerryzh168** in #2842\r\n   * **Memory optimization**: Use `yaml.CLoader` for faster YAML loading by **@giuliolovisotto** in #2777\r\n   * **Bug fixes**: Resolved MMLU generative metric aggregation (#2761) and context length handling issues (#2972)\r\n\r\n## New Benchmarks & Tasks\r\n\r\n### **Code Evaluation**\r\n* **HumanEval Instruct** - Instruction-following code generation benchmark by **@baberabb** in #2650\r\n* **MBPP Instruct** - Instruction-based Python programming evaluation by **@baberabb** in #2995\r\n\r\n### **Language Modeling**\r\n* **C4 Dataset Support** - Added perplexity evaluation on C4 web crawl dataset by **@Zephyr271828** in #2889\r\n\r\n### **Long Context Benchmarks**\r\n* **RULER and Longbench** - Long-context evaluation suites added by **@baberabb** in #2629\r\n\r\n### **Mathematical & Reasoning**\r\n* **GSM8K Platinum** - Enhanced mathematical reasoning benchmark by **@Qubitium** in #2771\r\n* **MastermindEval** - Logic reasoning evaluation by **@whoisjones** in #2788\r\n* **JSONSchemaBench** - Structured output evaluation by **@Saibo-creator** in #2865\r\n\r\n### **Llama Reference Implementations**\r\n* **Llama Reference Implementations** - Added task variants for Multilingual MMLU, MMLU CoT, GSM8K, and ARC Challenge based on Llama evaluation standards by **@anmarques** in #2797, #2826, #2829\r\n\r\n### **Multilingual Expansion**\r\n\r\n**Asian Languages**:\r\n* **Korean MMLU (KMMLU)** multiple-choice task by **@Aprilistic** in #2849\r\n* **MMLU-ProX** extended evaluation by **@heli-qi** in #2811\r\n* **KBL 2025 Dataset** - Updated Korean benchmark evaluation by **@abzb1** in #3000\r\n\r\n**European Languages**:\r\n* **NorEval** - Comprehensive Norwegian benchmark by **@vmkhlv** in #2919\r\n\r\n**African Languages**:\r\n* **AfroBench** - Multi-African language evaluation by **@JessicaOjo** in #2825\r\n* **Darija tasks** - Moroccan dialect benchmarks (DarijaMMLU, DarijaHellaSwag, Darija_Bench) by **@hadi-abdine** in #2521\r\n\r\n**Arabic Languages**:\r\n* **Arab Culture** task for cultural understanding by **@bodasadallah** in #3006\r\n\r\n### **Domain-Specific Benchmarks**\r\n* **CareQA** - Healthcare evaluation benchmark by **@PabloAgustin** in #2714\r\n* **ACPBench & ACPBench Hard** - Automated code generation evaluation by **@harshakokel** in #2807, #2980\r\n* **INCLUDE tasks** - Inclusivity evaluation suite by **@agromanou** in #2769\r\n* **Cocoteros VA** dataset by **@sgs97ua** in #2787\r\n\r\n### **Social & Bias Evaluation**\r\n* **Various social bias tasks** for fairness assessment by **@oskarvanderwal** in #1185\r\n\r\n## Technical Enhancements\r\n\r\n* **Fine-grained evaluation**: Added `--examples` argument for efficient multi-prompt evaluation by **@felipemaiapolo** and **@mirianfsilva** in #2520\r\n* **Improved tokenization**: Better handling of `add_bos_token` initialization by **@baberabb** in #2781\r\n* **Memory management**: Enhanced softmax computations with `softmax_dtype` argument for `HFLM` by **@Avelina9X** in #2921\r\n\r\n## Critical Bug Fixes\r\n\r\n* **Collating Queries Fix** - Resolved error with different continuation lengths that was causing evaluation failures by **@ameyagodbole** in #2987\r\n* **Mutual Information Metric** - Fixed acc_mutual_info calculation bug that affected metric accuracy by **@baberabb** in #3035\r\n\r\n## Breaking Changes & Important Updates\r\n\r\n* **MMLU dataset migration**: Switched to `cais\u002Fmmlu` dataset source by **@baberabb** in #2918\r\n* **Default parameter updates**: Increased `max_gen_toks` to 2048 and `max_length` to 8192 for MMLU Pro tests by **@dazipe** in #2824\r\n* **Temperature defaults**: Set default temperature to 0.0 for vLLM and SGLang backends by **@baberabb** in #2819\r\n\r\nWe extend our heartfelt thanks to all contributors who made this release possible, including **43 first-time contributors** who brought fresh perspectives and valuable improvements to the evaluation harness.\r\n\r\n## What's Changed\r\n* fix mmlu (generative) metric aggregation by @wangcho2k in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2761\r\n* Bugfix by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2762\r\n* fix verbosity typo by @baberab","2025-06-19T14:18:27",{"id":177,"version":178,"summary_zh":179,"released_at":180},106750,"v0.4.8","# lm-eval v0.4.8 Release Notes\r\n\r\n## Key Improvements\r\n\r\n* **New Backend Support**: \r\n  * Added SGLang as new evaluation backend! by @Monstertail\r\n  * Enabled model steering with vector support via `sparsify` or `sae_lens` by @luciaquirke and @AMindToThink \r\n\r\n* **Breaking Change**: Python 3.8 support has been dropped as it reached end of life. Please upgrade to Python 3.9 or newer.\r\n* **Added Support for `gen_prefix`** in config, allowing you to append text after the \u003C|assistant|> token (or at the end of non-chat prompts) - particularly effective for evaluating instruct models\r\n\r\n## New Benchmarks & Tasks\r\n\r\n### Code Evaluation\r\n* HumanEval by @hjlee1371 in #1992\r\n* MBPP by @hjlee1371 in #2247\r\n* HumanEval+ and MBPP+ by @bzantium in #2734\r\n\r\n### Multilingual Expansion\r\n* **Global Coverage**:\r\n  * Global MMLU (Lite version by @shivalika-singh in #2567, Full version by @bzantium in #2636)\r\n  * MLQA multilingual question answering by @KahnSvaer in #2622\r\n\r\n* **Asian Languages**:\r\n  * HRM8K benchmark for Korean and English by @bzantium in #2627\r\n  * Updated KorMedMCQA to version 2.0 by @GyoukChu in #2540\r\n  * Fixed TMLU Taiwan-specific tasks tag by @nike00811 in #2420\r\n\r\n* **European Languages**:\r\n  * Added Evalita-LLM benchmark by @m-resta in #2681\r\n  * BasqueBench with Basque translations of ARC and PAWS by @naiarapm in #2732\r\n  * Updated Turkish MMLU configuration by @ArdaYueksel in #2678\r\n\r\n* **Middle Eastern Languages**:\r\n  * Arabic MMLU by @bodasadallah in #2541\r\n  * AraDICE task by @firojalam in #2507\r\n\r\n### Ethics & Reasoning\r\n* Moral Stories by @upunaprosk in #2653\r\n* Histoires Morales by @upunaprosk in #2662\r\n\r\n### Others\r\n* MMLU Pro Plus by @asgsaeid in #2366\r\n* GroundCocoa by @HarshKohli in #2724\r\n\r\nWe extend our thanks to all contributors who made this release possible and to our users for your continued support and feedback.\r\n\r\nThanks, the LM Eval Harness team (@baberabb and @lintangsutawika)\r\n\r\n## What's Changed\r\n* drop python 3.8 support by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2575\r\n* Add Global MMLU Lite by @shivalika-singh in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2567\r\n* add warning for truncation by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2585\r\n* Wandb step handling bugfix and feature by @sjmielke in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2580\r\n* AraDICE task config file by @firojalam in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2507\r\n* fix extra_match low if batch_size > 1 by @sywangyi in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2595\r\n* fix model tests by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2604\r\n* update scrolls by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2602\r\n* some minor logging nits by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2609\r\n* Fix gguf loading via Transformers by @CL-ModelCloud in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2596\r\n* Fix Zeno visualizer on tasks like GSM8k by @pasky in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2599\r\n* Fix the format of mgsm zh and ja. by @timturing in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2587\r\n* Add HumanEval by @hjlee1371 in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1992\r\n* Add MBPP by @hjlee1371 in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2247\r\n* Add MLQA by @KahnSvaer in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2622\r\n* assistant prefill  by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2615\r\n* fix gen_prefix by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2630\r\n* update pre-commit by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2632\r\n* add hrm8k benchmark for both Korean and English by @bzantium in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2627\r\n* New arabicmmlu by @bodasadallah in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2541\r\n* Add `global_mmlu` full version by @bzantium in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2636\r\n* Update KorMedMCQA: ver 2.0 by @GyoukChu in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2540\r\n* fix tmlu tmlu_taiwan_specific_tasks tag by @nike00811 in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2420\r\n* fixed mmlu generative response extraction by @RawthiL in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2503\r\n* revise mbpp prompt by @bzantium in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2645\r\n* aggregate by group (total and categories) by @bzantium in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2643\r\n* Fix max_tokens handling in vllm_vlms.py by @jkaniecki in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2637\r\n* separate category for `global_mmlu` by @bzantium in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull","2025-03-05T07:49:46",{"id":182,"version":183,"summary_zh":184,"released_at":185},106751,"v0.4.7","# lm-eval v0.4.7 Release Notes\r\n\r\nThis release includes several bug fixes, minor improvements to model handling, and task additions.\r\n\r\n## ⚠️ Python 3.8 End of Support Notice\r\nPython 3.8 support will be dropped in future releases as it has reached its end of life. Users are encouraged to upgrade to Python 3.9 or newer.\r\n\r\n## Backwards Incompatibilities\r\n\r\n### Chat Template Delimiter Handling (in v0.4.6)\r\n\r\nAn important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.\r\n\r\n📝 For detailed documentation, please refer to [docs\u002Fchat-template-readme.md](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002Fmain\u002Fdocs\u002Fchat-template-readme.md)\r\n\r\n## New Benchmarks & Tasks\r\n\r\n- Basque Integration: Added Basque translation of PIQA (piqa_eu) to BasqueBench by @naiarapm in #2531\r\n- SCORE Tasks: Added new subtask for non-greedy robustness evaluation by @rimashahbazyan in #2558\r\n\r\nAs well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).\r\n\r\nThanks, the LM Eval Harness team (@baberabb and @lintangsutawika)\r\n\r\n\r\n## What's Changed\r\n* Score tasks by @rimashahbazyan in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2452\r\n* Filters bugfix; add `metrics` and `filter` to logged sample by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2517\r\n* skip casting if predict_only by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2524\r\n* make utility function to handle `until` by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2518\r\n* Update Unitxt task to  use locally installed unitxt and not download Unitxt code from Huggingface by @yoavkatz in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2514\r\n* add Basque translation of PIQA (piqa_eu) to BasqueBench by @naiarapm in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2531\r\n* avoid timeout errors with high concurrency in api_model by @dtrawins in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2307\r\n* Update README.md by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2534\r\n* better doc_to_test testing by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2535\r\n* Support pipeline parallel with OpenVINO models by @sstrehlk in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2349\r\n* Super little tiny fix doc by @fzyzcjy in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2546\r\n* [API] left truncate for generate_until by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2554\r\n* Update Lightning import by @maanug-nv in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2549\r\n* add optimum-intel ipex model by @yao-matrix in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2566\r\n* add warning to readme by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2568\r\n* Adding new subtask to SCORE tasks: non greedy robustness by @rimashahbazyan in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2558\r\n* batch `loglikelihood_rolling` across requests by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2559\r\n* fix `DeprecationWarning: invalid escape sequence '\\s'` for whitespace filter by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2560\r\n* increment version to 4.6.7 by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2574\r\n\r\n## New Contributors\r\n* @rimashahbazyan made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2452\r\n* @naiarapm made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2531\r\n* @dtrawins made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2307\r\n* @sstrehlk made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2349\r\n* @fzyzcjy made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2546\r\n* @maanug-nv made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2549\r\n* @yao-matrix made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2566\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fcompare\u002Fv0.4.6...v0.4.7","2024-12-17T10:37:09",{"id":187,"version":188,"summary_zh":189,"released_at":190},106752,"v0.4.6","# lm-eval v0.4.6 Release Notes\r\n\r\nThis release brings important changes to chat template handling, expands our task library with new multilingual and multimodal benchmarks, and includes various bug fixes.\r\n\r\n## Backwards Incompatibilities\r\n\r\n### Chat Template Delimiter Handling\r\n\r\nAn important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.\r\n\r\n📝 For detailed documentation, please refer to [docs\u002Fchat-template-readme.md](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002Fmain\u002Fdocs\u002Fchat-template-readme.md)\r\n\r\n## New Benchmarks & Tasks\r\n\r\n### Multilingual Expansion\r\n- **Spanish Bench**: Enhanced benchmark with additional tasks by @zxcvuser in #2390\r\n- **Japanese Leaderboard**: New comprehensive Japanese language benchmark by @sitfoxfly in #2439\r\n\r\n### New Task Collections\r\n- **Multimodal Unitext**: Added support for multimodal tasks available in unitext by @elronbandel in #2364\r\n- **Metabench**: New benchmark contributed by @kozzy97 in #2357\r\n\r\nAs well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).\r\n\r\nThanks, the LM Eval Harness team (@baberabb and @lintangsutawika)\r\n\r\n\r\n\r\n## What's Changed\r\n* Add Unitxt Multimodality Support by @elronbandel in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2364\r\n* Add new tasks to spanish_bench and fix duplicates by @zxcvuser in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2390\r\n* fix typo bug for minerva_math by @renjie-ranger in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2404\r\n* Fix: Turkish MMLU Regex Pattern by @ArdaYueksel in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2393\r\n* fix storycloze datanames by @t1101675 in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2409\r\n* Update NoticIA prompt by @ikergarcia1996 in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2421\r\n* [Fix] Replace generic exception classes with a more specific ones by @LSinev in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1989\r\n* Support for IBM watsonx_llm by @Medokins in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2397\r\n* Fix package extras for watsonx support by @kiersten-stokes in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2426\r\n* Fix lora requests when dp with vllm by @ckgresla in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2433\r\n* Add xquad task by @zxcvuser in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2435\r\n* Add verify_certificate argument to local-completion by @sjmonson in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2440\r\n* Add GPTQModel support for evaluating GPTQ models by @Qubitium in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2217\r\n* Add missing task links by @Sypherd in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2449\r\n* Update CODEOWNERS by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2453\r\n* Add real process_docs example by @Sypherd in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2456\r\n* Modify label errors in catcola and paws-x by @zxcvuser in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2434\r\n* Add Japanese Leaderboard by @sitfoxfly in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2439\r\n* Typos: Fix 'loglikelihood' misspellings in api_models.py by @RobGeada in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2459\r\n* use global `multi_choice_filter` for mmlu_flan by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2461\r\n* typo by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2465\r\n* pass device_map other than auto for parallelize by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2457\r\n* OpenAI ChatCompletions: switch `max_tokens` by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2443\r\n* Ifeval: Dowload `punkt_tab` on rank 0 by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2267\r\n* Fix chat template; fix leaderboard math by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2475\r\n* change warning to debug by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2481\r\n* Updated wandb logger to use `new_printer()` instead of `get_printer(...)` by @alex-titterton in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2484\r\n* IBM watsonx_llm fixes & refactor by @Medokins in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2464\r\n* Fix revision parameter to vllm get_tokenizer by @OyvindTafjord in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2492\r\n* update pre-commit hooks and git actions by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2497\r\n* kbl-v0.1.1 by @whwang299 in https:\u002F\u002Fgi","2024-11-25T13:38:29",{"id":192,"version":193,"summary_zh":194,"released_at":195},106753,"v0.4.5","# lm-eval v0.4.5 Release Notes\r\n\r\n## New Additions\r\n\r\n### Prototype Support for Vision Language Models (VLMs)\r\n\r\nWe're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types `hf-multimodal` and `vllm-vlm`.  This allows for evaluation of models that can process text and image inputs and produce text outputs. Currently we have added support for the MMMU (`mmmu_val`) task and we welcome contributions and feedback from the community!\r\n\r\n#### New VLM-Specific Arguments\r\n\r\nVLM models can be configured with several new arguments within `--model_args` to support their specific requirements:\r\n\r\n- `max_images` (int): Set the maximum number of images for each prompt.\r\n- `interleave` (bool): Determines the positioning of image inputs. When `True` (default) images are interleaved with the text. When `False` all images are placed at the front of the text. This is model dependent.\r\n\r\n`hf-multimodal` specific args:\r\n- `image_token_id` (int) or `image_string` (str): Specifies a custom token or string for image placeholders. For example, Llava models expect an `\"\u003Cimage>\"` string to indicate the location of images in the input, while Qwen2-VL models expect an `\"\u003C|image_pad|>\"` sentinel string instead. This will be inferred based on model configuration files whenever possible, but we recommend confirming that an override is needed when testing a new model family\r\n- `convert_img_format` (bool): Whether to convert the images to RGB format.\r\n\r\n#### Example usage:\r\n\r\n- `lm_eval --model hf-multimodal --model_args pretrained=llava-hf\u002Fllava-1.5-7b-hf,attn_implementation=flash_attention_2,max_images=1,interleave=True,image_string=\u003Cimage> --tasks mmmu_val --apply_chat_template`\r\n\r\n- `lm_eval --model vllm-vlm --model_args pretrained=llava-hf\u002Fllava-1.5-7b-hf,max_images=1,interleave=True --tasks mmmu_val --apply_chat_template`\r\n\r\n#### Important considerations\r\n\r\n1. **Chat Template**: Most VLMs require the `--apply_chat_template` flag to ensure proper input formatting according to the model's expected chat template.\r\n2. Some VLM models are limited to processing a single image per prompt. For these models, always set `max_images=1`. Additionally, certain models expect image placeholders to be non-interleaved with the text, requiring `interleave=False`.\r\n3. Performance and Compatibility: When working with VLMs, be mindful of potential memory constraints and processing times, especially when handling multiple images or complex tasks.\r\n\r\n#### Tested VLM Models\r\n\r\nWe have currently most notably tested the implementation with the following models:\r\n\r\n- llava-hf\u002Fllava-1.5-7b-hf\r\n- llava-hf\u002Fllava-v1.6-mistral-7b-hf\r\n- Qwen\u002FQwen2-VL-2B-Instruct\r\n- HuggingFaceM4\u002Fidefics2 (requires the latest `transformers` from source)\r\n\r\n\r\n## New Tasks\r\n\r\nSeveral new tasks have been contributed to the library for this version!\r\n\r\n\r\nNew tasks as of v0.4.5 include:\r\n- Open Arabic LLM Leaderboard tasks, contributed by @shahrzads @Malikeh97 in #2232\r\n- **MMMU (validation set), by @haileyschoelkopf @baberabb @lintangsutawika in #2243**\r\n- TurkishMMLU by @ArdaYueksel in #2283\r\n- PortugueseBench, SpanishBench, GalicianBench, BasqueBench, and CatalanBench aggregate multilingual tasks in #2153 #2154 #2155 #2156 #2157 by @zxcvuser and others\r\n\r\n\r\nAs well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).\r\n\r\n\r\n## Backwards Incompatibilities\r\n\r\n### Finalizing `group` versus `tag` split\r\n\r\nWe've now fully deprecated the use of `group` keys directly within a task's configuration file. The appropriate key to use is now solely `tag` for many cases. See the [v0.4.4 patchnotes](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Freleases\u002Ftag\u002Fv0.4.4) for more info on migration, if you have a set of task YAMLs maintained outside the Eval Harness repository.\r\n\r\n### Handling of Causal vs. Seq2seq backend in HFLM\r\n\r\nIn HFLM, logic specific to handling inputs for Seq2seq (encoder-decoder models like T5) versus Causal (decoder-only autoregressive models, and the vast majority of current LMs) models previously hinged on a check for `self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM`. Some users may want to use causal model behavior, but set `self.AUTO_MODEL_CLASS` to a different factory class, such as `transformers.AutoModelForVision2Seq`.\r\n\r\nAs a result, those users who subclass HFLM but do not call `HFLM.__init__()` may now also need to set the `self.backend` attribute to either `\"causal\"` or `\"seq2seq\"` during initialization themselves.\r\n\r\nWhile this should not affect a large majority of users, for those who subclass HFLM in potentially advanced ways, see https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F2353 for the full set of changes.\r\n\r\n### Future Plans\r\n\r\nWe intend to further expand our multimodal support to a wider set of vision-language tasks, as well as a broader set of model types, and are actively seeking user feedback!\r\n\r\nThanks, the LM Eval Harness team (@babera","2024-10-08T21:06:05",{"id":197,"version":198,"summary_zh":199,"released_at":200},106754,"v0.4.4","# lm-eval v0.4.4 Release Notes\r\n\r\n## New Additions\r\n\r\n- This release includes the **Open LLM Leaderboard 2** official task implementations! These can be run by using `--tasks leaderboard`. Thank you to the HF team (@clefourrier, @NathanHB , @KonradSzafer, @lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release [here](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopen-llm-leaderboard\u002Fopen_llm_leaderboard). \r\n\r\n\r\n\r\n- **API support is overhauled!** Now: support for *concurrent requests*, chat templates, tokenization, *batching* and improved customization. This makes API support both more generalizable to new providers and should dramatically speed up API model inference.\r\n    - The url can be specified by passing the `base_url` to `--model_args`, for example, `base_url=http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fcompletions`; concurrent requests are controlled with the `num_concurrent` argument; tokenization is controlled with `tokenized_requests`. \r\n    - Other arguments (such as top_p, top_k, etc.) can be passed to the API using `--gen_kwargs` as usual.\r\n    - Note: Instruct-tuned models, not just base models, can be used with `local-completions` using `--apply_chat_template` (either with or without `tokenized_requests`). \r\n        - They can also be used with `local-chat-completions` (for e.g. with a OpenAI Chat API endpoint), but only the former supports loglikelihood tasks (e.g. multiple-choice). **This is because ChatCompletion style APIs generally do not provide access to logits on prompt\u002Finput tokens, preventing easy measurement of multi-token continuations' log probabilities.**\r\n    - example with OpenAI completions API (using vllm serve):\r\n        - `lm_eval --model local-completions --model_args model=meta-llama\u002FMeta-Llama-3.1-8B-Instruct,num_concurrent=10,tokenized_requests=True,tokenizer_backend=huggingface,max_length=4096 --apply_chat_template --batch_size 1 --tasks mmlu`\r\n    - example with chat API:\r\n        - `lm_eval --model local-chat-completions --model_args model=meta-llama\u002FMeta-Llama-3.1-8B-Instruct,num_concurrent=10 --apply_chat_template --tasks gsm8k`\r\n    - We recommend evaluating Llama-3.1-405B models via serving them with vllm then running under `local-completions`!\r\n\r\n- **We've reworked the Task Grouping system to make it clearer when and when not to report an aggregated average score across multiple subtasks**. See **#Backwards Incompatibilities** below for more information on changes and migration instructions.\r\n\r\n- A combination of data-parallel and model-parallel (using HF's `device_map` functionality for \"naive\" pipeline parallel) inference using `--model hf` is now supported, thank you to @NathanHB and team!\r\n\r\nOther new additions include a number of miscellaneous bugfixes and much more. Thank you to all contributors who helped out on this release!\r\n\r\n\r\n## New Tasks\r\n\r\nA number of new tasks have been contributed to the library.\r\n\r\nAs a further discoverability improvement, `lm_eval --tasks list` now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! Thank you to @anthony-dipofi for working on this.\r\n\r\nNew tasks as of v0.4.4 include:\r\n- Open LLM Leaderboard 2 tasks--see above!\r\n- Inverse Scaling tasks, contributed by @h-albert-lee in #1589\r\n- Unitxt tasks reworked by @elronbandel in #1933\r\n- MMLU-SR, contributed by @SkySuperCat in #2032\r\n- IrokoBench, contributed by @JessicaOjo @IsraelAbebe in #2042\r\n- MedConceptQA, contributed by @Ofir408 in #2010\r\n- MMLU Pro, contributed by @ysjprojects in #1961\r\n- GSM-Plus, contributed by @ysjprojects in #2103\r\n- Lingoly, contributed by @am-bean in #2198\r\n- GSM8k and Asdiv settings matching the Llama 3.1 evaluation settings, contributed by @Cameron7195 in #2215 #2236\r\n- TMLU, contributed by @adamlin120 in #2093\r\n- Mela, contributed by @Geralt-Targaryen in #1970\r\n\r\n\r\n\r\n\r\n## Backwards Incompatibilities\r\n\r\n### `tag`s versus `group`s, and how to migrate\r\n\r\nPreviously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for \"parent\" tasks like `mmlu` to aggregate and report a unified score across a set of component \"subtasks\".\r\n\r\n\r\nThere were two ways to add a task to a given `group` name: 1) to provide (a list of) values to the `group` field in a given subtask's config file:\r\n\r\n```yaml\r\n# this is a *task* yaml file.\r\ngroup: group_name1\r\ntask: my_task1\r\n# rest of task config goes here...\r\n```\r\n\r\nor 2) to define a \"group config file\" and specify a group along with its constituent subtasks:\r\n\r\n```yaml\r\n# this is a group's yaml file\r\ngroup: group_name1\r\ntask:\r\n  - subtask_name1\r\n  - subtask_name2\r\n  # ...\r\n```\r\n\r\nThese would both have the same effect of **reporting an averaged metric for group_name1** when calling `lm_eval --tasks group_name1`. However, in use-case 1) (simply registering a shortha","2024-09-05T15:13:13",{"id":202,"version":203,"summary_zh":204,"released_at":205},106755,"v0.4.3","# lm-eval v0.4.3 Release Notes\r\n\r\nWe're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future.\r\n\r\n## New Additions\r\n\r\n\r\nThe big new feature is the often-requested **Chat Templating**, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors!\r\n\r\nYou can now run using a chat template with `--apply_chat_template` and a system prompt of your choosing using `--system_instruction \"my sysprompt here\"`.  The `--fewshot_as_multiturn` flag can control whether each few-shot example in context is a new conversational turn or not.\r\n\r\nThis feature is **currently only supported for model types `hf` and `vllm`** but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs.\r\n\r\n\r\n\r\nThere's a lot more to check out, including:\r\n\r\n- Logging results to the HF Hub if desired using `--hf_hub_log_args`, by @KonradSzafer and team!\r\n\r\n- NeMo model support by @sergiopperez !\r\n- Anthropic Chat API support by @tryuman !\r\n- DeepSparse and SparseML model types by @mgoin !\r\n\r\n- Handling of delta-weights in HF models, by @KonradSzafer !\r\n- LoRA support for VLLM, by @bcicc !\r\n- Fixes to PEFT modules which add new tokens to the embedding layers, by @mapmeld !\r\n\r\n- Fixes to handling of BOS tokens in multiple-choice loglikelihood settings, by @djstrong !\r\n- The use of custom `Sampler` subclasses in tasks, by @LSinev !\r\n- The ability to specify \"hardcoded\" few-shot examples more cleanly, by @clefourrier !\r\n\r\n- Support for Ascend NPUs (`--device npu`) by @statelesshz, @zhabuye, @jiaqiw09 and others!\r\n- Logging of `higher_is_better` in results tables for clearer understanding of eval metrics by @zafstojano !\r\n\r\n- extra info logged about models, including info about tokenizers, chat templating, and more, by @artemorloff @djstrong and others!\r\n\r\n- Miscellaneous bug fixes! And many more great contributions we weren't able to list here.\r\n\r\n## New Tasks\r\n\r\nWe had a number of new tasks contributed. **A listing of subfolders and a brief description of the tasks contained in them can now be found at `lm_eval\u002Ftasks\u002FREADME.md`**. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the appropriate README.md within a given `lm_eval\u002Ftasks` subfolder, for further info on each task contained within a given folder. Thank you to @AnthonyDipofi @Harryalways317 @nairbv @sepiatone and others for working on this and giving feedback! \r\n\r\n\r\nWithout further ado, the tasks:\r\n- ACLUE, a benchmark for Ancient Chinese understanding, by @haonan-li\r\n- BasqueGlue and EusExams, two Basque-language tasks by @juletx\r\n- TMMLU+, an evaluation for Traditional Chinese, contributed by @ZoneTwelve\r\n- XNLIeu, a Basque version of XNLI, by @juletx\r\n- Pile-10K, a perplexity eval taken from a subset of the Pile's validation set, contributed by @mukobi\r\n- FDA, SWDE, and Squad-Completion zero-shot tasks by @simran-arora and team\r\n- Added back the `hendrycks_math` task, the MATH task using the prompt and answer parsing from the original Hendrycks et al. MATH paper rather than Minerva's prompt and parsing\r\n- COPAL-ID, a natively-Indonesian commonsense benchmark, contributed by @Erland366\r\n- tinyBenchmarks variants of the Open LLM Leaderboard 1 tasks, by @LucWeber and team!\r\n- Glianorex, a benchmark for testing performance on fictional medical questions, by @maximegmd\r\n- New FLD (formal logic) task variants by @MorishT\r\n- Improved translations of Lambada Multilingual tasks, added by @zafstojano\r\n- NoticIA, a Spanish summarization dataset by @ikergarcia1996\r\n- The Paloma perplexity benchmark, added by @zafstojano\r\n- We've removed the AMMLU dataset due to concerns about auto-translation quality. \r\n- Added the *localized*, not translated, ArabicMMLU dataset, contributed by @Yazeed7 !\r\n- BertaQA, a Basque cultural knowledge benchmark, by @juletx\r\n- New machine-translated ARC-C datasets by @jonabur !\r\n- CommonsenseQA, in a prompt format following Llama, by @murphybrendan\r\n- ...\r\n\r\n## Backwards Incompatibilities\r\n\r\nThe save format for logged results has now changed.\r\n\r\noutput files will now be written to \r\n- `{output_path}\u002F{sanitized_model_name}\u002Fresults_YYYY-MM-DDTHH-MM-SS.xxxxx.json` if `--output_path` is set, and\r\n- `{output_path}\u002F{sanitized_model_name}\u002Fsamples_{task_name}_YYYY-MM-DDTHH-MM-SS.xxxxx.jsonl` for each task's samples if `--log_samples` is set.\r\n\r\ne.g. `outputs\u002Fgpt2\u002Fresults_2024-06-28T00-00-00.00001.json` and `outputs\u002Fgpt2\u002Fsamples_lambada_openai_2024-06-28T00-00-00.00001.jsonl`.\r\n\r\nSee https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1926 for utilities which may help to work with these new filenames.\r\n\r\n## Future Plans\r\n\r\nIn general, we'll be doing our best to keep up with the strong interest and large number of contributions we've seen coming in!\r\n\r\n\r\n- The official **Open LLM Leaderboard 2** t","2024-07-01T14:00:36",{"id":207,"version":208,"summary_zh":209,"released_at":210},106756,"v0.4.2","# lm-eval v0.4.2 Release Notes\r\n\r\nWe are releasing a new minor version of lm-eval for PyPI users! We've been very happy to see continued usage of the lm-evaluation-harness, including as a standard testbench to propel new architecture design (https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.18668), to ease new benchmark creation (https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11548, https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.00786, https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.01469), enabling controlled experimentation on LLM evaluation (https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01781), and more!\r\n\r\n## New Additions\r\n- Request Caching by @inf3rnus - speedups on startup via caching the construction of documents\u002Frequests’ contexts\r\n- Weights and Biases logging by @ayulockin - evals can now be logged to both WandB and Zeno!\r\n- New Tasks\r\n\t- KMMLU, a localized - not (auto) translated! - dataset for testing Korean knowledge by @h-albert-lee @guijinSON\r\n\t- GPQA by @uanu2002\r\n\t- French Bench by @ManuelFay\r\n\t- EQ-Bench by @pbevan1 and @sqrkl\r\n\t- HAERAE-Bench, readded by @h-albert-lee\r\n\t- Updates to answer parsing on many generative tasks  (GSM8k, MGSM, BBH zeroshot) by @thinknbtfly!\r\n\t- Okapi (translated) Open LLM Leaderboard tasks by @uanu2002 and @giux78 \r\n\t- Arabic MMLU and aEXAMS by @khalil-hennara \r\n\t- And more!\r\n- Re-introduction of `TemplateLM` base class for lower-code new LM class implementations by @anjor\r\n- Run the library with metrics\u002Fscoring stage skipped via `--predict_only` by @baberabb\r\n- Many more miscellaneous improvements by a lot of great contributors!\r\n\r\n## Backwards Incompatibilities\r\n\r\nThere were a few breaking changes to lm-eval's general API or logic we'd like to highlight:\r\n### `TaskManager` API \r\n\r\npreviously, users had to call `lm_eval.tasks.initialize_tasks()` to register the library's default tasks, or `lm_eval.tasks.include_path()` to include a custom directory of task YAML configs. \r\n\r\nOld usage:\r\n```\r\nimport lm_eval\r\n\r\nlm_eval.tasks.initialize_tasks() \r\n# or:\r\nlm_eval.tasks.include_path(\"\u002Fpath\u002Fto\u002Fmy\u002Fcustom\u002Ftasks\")\r\n\r\n \r\nlm_eval.simple_evaluate(model=lm, tasks=[\"arc_easy\"])\r\n```\r\n\r\nNew intended usage:\r\n```\r\nimport lm_eval\r\n\r\n# optional--only need to instantiate separately if you want to pass custom path!\r\ntask_manager = TaskManager() # pass include_path=\"\u002Fpath\u002Fto\u002Fmy\u002Fcustom\u002Ftasks\" if desired\r\n\r\nlm_eval.simple_evaluate(model=lm, tasks=[\"arc_easy\"], task_manager=task_manager)\r\n```\r\n`get_task_dict()` now also optionally takes a TaskManager object, when wanting to load custom tasks.\r\n\r\nThis should allow for much faster library startup times due to lazily loading requested tasks or groups. \r\n\r\n### Updated Stderr Aggregation\r\n\r\nPrevious versions of the library incorrectly reported erroneously large `stderr` scores for groups of tasks such as MMLU. \r\n\r\nWe've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see #1390 #1427 for more information. \r\n\r\n\r\n\r\nAs always, please feel free to give us feedback or request new features! We're grateful for the community's support. \r\n\r\n\r\n## What's Changed\r\n* Add support for RWKV models with World tokenizer by @PicoCreator in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1374\r\n* add bypass metric by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1156\r\n* Expand docs, update CITATION.bib by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1227\r\n* Hf: minor egde cases by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1380\r\n* Enable override of printed `n-shot` in table by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1379\r\n* Faster Task and Group Loading, Allow Recursive Groups by @lintangsutawika in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1321\r\n* Fix for https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fissues\u002F1383 by @pminervini in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1384\r\n* fix on --task list by @lintangsutawika in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1387\r\n* Support for Inf2 optimum class [WIP] by @michaelfeil in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1364\r\n* Update README.md by @mycoalchen in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1398\r\n* Fix confusing `write_out.py` instructions in README by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1371\r\n* Use Pooled rather than Combined Variance for calculating stderr of task groupings by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1390\r\n* adding hf_transfer by @michaelfeil in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1400\r\n* `batch_size` with `auto` defaults to 1 if `No executable batch size found` is raised by @pminervini in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1405\r\n* Fix printing bug in #1390 by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1414\r\n* Fixe","2024-03-18T13:07:28",{"id":212,"version":213,"summary_zh":214,"released_at":215},106757,"v0.4.1","## Release Notes\r\n\r\nThis PR release contains all changes so far since the release of v0.4.0 , and is partially a test of our release automation, provided by @anjor .\r\n\r\nAt a high level, some of the changes include: \r\n\r\n- Data-parallel inference using vLLM (contributed by @baberabb )\r\n- A major fix to Huggingface model generation--previously, in v0.4.0, due to a bug with stop sequence handling, generations were sometimes cut off too early.\r\n- Miscellaneous documentation updates\r\n- A number of new tasks, and bugfixes to old tasks!\r\n- The support for OpenAI-like API models using `local-completions` or `local-chat-completions` ( Thanks to @veekaybee @mgoin @anjor and others on this)!\r\n- Integration with tools for visualization of results, such as with Zeno, and WandB coming soon!\r\n\r\nMore frequent (minor) version releases may be done in the future, to make it easier for PyPI users!\r\n\r\nWe're very pleased by the uptick in interest in LM Evaluation Harness recently, and we hope to continue to improve the library as time goes on. We're grateful to everyone who's contributed, and are excited by how many new contributors this version brings! If you have feedback for us, or would like to help out developing the library, please let us know.\r\n\r\nIn the next version release, we hope to include\r\n- Chat Templating + System Prompt support, for locally-run models\r\n- Improved Answer Extraction for many generative tasks, making them more easily run zero-shot and less dependent on model output formatting\r\n- General speedups and QoL fixes to the non-inference portions of LM-Evaluation-Harness, including drastically reduced startup times \u002F faster non-inference processing steps especially when num_fewshot is large!\r\n- A new `TaskManager` object and the deprecation of `lm_eval.tasks.initialize_tasks()`, for achieving the easier registration of many tasks and configuration of new groups of tasks\r\n\r\n## What's Changed\r\n* Announce v0.4.0 in README by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1061\r\n* remove commented planned samplers in `lm_eval\u002Fapi\u002Fsamplers.py` by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1062\r\n* Confirming links in docs work (WIP) by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1065\r\n* Set actual version to v0.4.0 by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1064\r\n* Updating docs hyperlinks  by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1066\r\n* Fiddling with READMEs, Reenable CI tests on `main` by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1063\r\n* Update _cot_fewshot_template_yaml by @lintangsutawika in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1074\r\n* Patch scrolls by @lintangsutawika in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1077\r\n* Update template of qqp dataset by @shiweijiezero in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1097\r\n* Change the sub-task name from sst to sst2 in glue by @shiweijiezero in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1099\r\n* Add kmmlu evaluation to tasks by @h-albert-lee in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1089\r\n* Fix stderr by @lintangsutawika in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1106\r\n* Simplified `evaluator.py` by @lintangsutawika in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1104\r\n* [Refactor] vllm data parallel by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1035\r\n* Unpack group in `write_out` by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1113\r\n* Revert \"Simplified `evaluator.py`\" by @lintangsutawika in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1116\r\n* `qqp`, `mnli_mismatch`: remove unlabled test sets by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1114\r\n* fix: bug of BBH_cot_fewshot by @Momo-Tori in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1118\r\n* Bump BBH version by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1120\r\n* Refactor `hf` modeling code by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1096\r\n* Additional process for doc_to_choice by @lintangsutawika in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1093\r\n* doc_to_decontamination_query can use function by @lintangsutawika in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1082\r\n* Fix vllm `batch_size` type by @xTayEx in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1128\r\n* fix: passing max_length to vllm engine args by @NanoCode012 in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1124\r\n* Fix Loading Local Dataset by @lintangsutawika in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F1127\r\n* place model onto `mps` by @baberabb in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002F","2024-01-31T15:29:14",{"id":217,"version":218,"summary_zh":219,"released_at":220},106758,"v0.4.0","## What's Changed\r\n* Replace stale `triviaqa` dataset link by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F364\r\n* Update `actions\u002Fsetup-python`in  CI workflows by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F365\r\n* Bump `triviaqa` version by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F366\r\n* Update `lambada_openai` multilingual data source by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F370\r\n* Update Pile Test\u002FVal Download URLs by @fattorib in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F373\r\n* Added ToxiGen task by @Thartvigsen in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F377\r\n* Added CrowSPairs by @aflah02 in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F379\r\n* Add accuracy metric to crows-pairs by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F380\r\n* hotfix(gpt2): Remove vocab-size logits slice by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F384\r\n* Enable \"low_cpu_mem_usage\" to reduce the memory usage of HF models by @sxjscience in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F390\r\n* Upstream `hf-causal` and `hf-seq2seq` model implementations by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F381\r\n* Hosting arithmetic dataset on HuggingFace by @fattorib in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F391\r\n* Hosting wikitext on HuggingFace  by @fattorib in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F396\r\n* Change device parameter to cuda:0 to avoid runtime error by @Jeffwan in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F403\r\n* Update README installation instructions by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F407\r\n* feat: evaluation using peft models with CLM by @zanussbaum in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F414\r\n* Update setup.py dependencies by @ret2libc in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F416\r\n* fix: add seq2seq peft by @zanussbaum in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F418\r\n* Add support for load_in_8bit and trust_remote_code model params by @philwee in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F422\r\n* Hotfix: patch issues with the `huggingface.py` model classes by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F427\r\n* Continuing work on refactor [WIP] by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F425\r\n* Document task name wildcard support in README by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F435\r\n* Add non-programmatic BIG-bench-hard tasks by @yurodiviy in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F406\r\n* Updated handling for device in lm_eval\u002Fmodels\u002Fgpt2.py by @nikhilpinnaparaju in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F447\r\n* [WIP, Refactor] Staging more changes by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F465\r\n* [Refactor, WIP] Multiple Choice + loglikelihood_rolling support for YAML tasks by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F467\r\n* Configurable-Tasks by @lintangsutawika in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F438\r\n* single GPU automatic batching logic by @fattorib in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F394\r\n* Fix bugs introduced in #394 #406 and max length bug by @juletx in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F472\r\n* Sort task names to keep the same order always by @juletx in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F474\r\n* Set PAD token to EOS token by @nikhilpinnaparaju in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F448\r\n* [Refactor] Add decorator for registering YAMLs as tasks by @haileyschoelkopf in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F486\r\n* fix adaptive batch crash when there are no new requests by @jquesnelle in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F490\r\n* Add multilingual datasets (XCOPA, XStoryCloze, XWinograd, PAWS-X, XNLI, MGSM) by @juletx in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F426\r\n* Create output path directory if necessary by @janEbert in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F483\r\n* Add results of various models in json and md format by @juletx in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F477\r\n* Update config by @lintangsutawika in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F501\r\n* P3 prompt task by @lintangsutawika in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F493\r\n* Evaluation Against Portion of Benchmark Data by @kenhktsui in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F480\r\n* Add option to dump prompts and completions to a JSON file by @juletx in https:\u002F\u002Fgithub.com\u002FEle","2023-12-04T15:08:53",{"id":222,"version":223,"summary_zh":224,"released_at":225},106759,"v0.3.0","## HuggingFace Datasets Integration\r\nThis release integrates HuggingFace `datasets` as the core dataset management interface, removing previous custom downloaders.\r\n\r\n## What's Changed\r\n* Refactor `Task` downloading to use `HuggingFace.datasets` by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F300\r\n* Add templates and update docs by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F308\r\n* Add dataset features to `TriviaQA` by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F305\r\n* Add `SWAG` by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F306\r\n* Fixes for using lm_eval as a library by @dirkgr in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F309\r\n* Researcher2 by @researcher2 in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F261\r\n* Suggested updates for the task guide by @StephenHogg in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F301\r\n* Add pre-commit by @Mistobaan in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F317\r\n* Decontam import fix by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F321\r\n* Add bootstrap_iters kwarg by @Muennighoff in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F322\r\n* Update decontamination.md by @researcher2 in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F331\r\n* Fix key access in squad evaluation metrics by @konstantinschulz in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F333\r\n* Fix make_disjoint_window for tail case by @richhankins in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F336\r\n* Manually concat tokenizer revision with subfolder by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F343\r\n* [deps] Use minimum versioning for `numexpr` by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F352\r\n* Remove custom datasets that are in HF by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F330\r\n* Add `TextSynth` API by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F299\r\n* Add the original `LAMBADA` dataset by @jon-tow in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F357\r\n\r\n## New Contributors\r\n* @dirkgr made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F309\r\n* @Mistobaan made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F317\r\n* @konstantinschulz made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F333\r\n* @richhankins made their first contribution in https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fpull\u002F336\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fcompare\u002Fv0.2.0...v0.3.0","2022-12-08T08:34:37",{"id":227,"version":228,"summary_zh":229,"released_at":230},106760,"v0.2.0","Major changes since 0.1.0:\r\n\r\n- added blimp (#237)\r\n- added qasper (#264)\r\n- added asdiv (#244)\r\n- added truthfulqa (#219)\r\n- added gsm (#260)\r\n- implemented description dict and deprecated provide_description (#226)\r\n- new `--check_integrity` flag to run integrity unit tests at eval time (#290)\r\n- positional arguments to `evaluate` and `simple_evaluate` are now deprecated\r\n- `_CITATION` attribute on task modules (#292)\r\n- lots of bug fixes and task fixes (always remember to report task versions for comparability!)","2022-03-07T02:12:23",{"id":232,"version":233,"summary_zh":79,"released_at":234},106761,"v0.0.1","2021-09-02T02:28:08"]