[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-mlfoundations--evalchemy":3,"tool-mlfoundations--evalchemy":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":78,"owner_location":78,"owner_email":78,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":113,"forks":114,"last_commit_at":115,"license":78,"difficulty_score":10,"env_os":116,"env_gpu":117,"env_ram":118,"env_deps":119,"category_tags":132,"github_topics":78,"view_count":23,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":133,"updated_at":134,"faqs":135,"releases":171},2236,"mlfoundations\u002Fevalchemy","evalchemy","Automatic evals for LLMs","Evalchemy 是一款专为评估后训练大语言模型（LLM）打造的统一工具包，由 DataComp 社区与 Bespoke Labs 联合开发。它旨在解决大模型评估中环境配置复杂、依赖冲突频发以及多基准测试难以统一管理的痛点，让研究人员和开发者能够更专注于模型性能分析而非繁琐的工程搭建。\n\n无论是需要验证推理能力的学术研究者，还是致力于优化生产级模型的工程师，Evalchemy 都能提供极大的便利。其核心亮点在于“一键式”安装体验，彻底消除了传统评估流程中的依赖地狱；同时支持数据并行与模型并行，既能利用多 GPU 加速评估，也能轻松承载超大参数模型。此外，Evalchemy 拥有广泛的兼容性，不仅内置了 AIME、MATH500 等最新推理基准，还原生支持 vLLM 高速推理引擎及通过 Curator 调用的各类 API 模型（如 OpenAI、Gemini 等）。配合标准化的结果追踪与排行榜提交功能，Evalchemy 让大模型评估变得高效、规范且易于复现。","# 🧪 Evalchemy\n\n> A unified and easy-to-use toolkit for evaluating post-trained language models\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmlfoundations_evalchemy_readme_c67227dbbf6a.png)\n\nEvalchemy is developed by the [DataComp community](https:\u002F\u002Fdatacomp.ai) and [Bespoke Labs](https:\u002F\u002Fbespokelabs.ai)  and builds on the [LM-Eval-Harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness).\n\n\n## 🎉 What's New \n\n#### [2025.02.24] New Reasoning Benchmarks\n\n- AIME25 and Alice in Wonderland have been added to [available benchmarks](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fevalchemy?tab=readme-ov-file#built-in-benchmarks).\n\n#### [2025.01.30] API Model Support\n\n- [API models via Curator](https:\u002F\u002Fgithub.com\u002Fbespokelabsai\u002Fcurator\u002F): With `--model curator` you can now evaluate with even more API based models via [Curator](https:\u002F\u002Fgithub.com\u002Fbespokelabsai\u002Fcurator\u002F), including all those supported by [LiteLLM](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002Fproviders) \n\n```\n  python -m eval.eval \\\n        --model curator  \\\n        --tasks AIME24,MATH500,GPQADiamond \\\n        --model_name \"gemini\u002Fgemini-2.0-flash-thinking-exp-01-21\" \\\n        --apply_chat_template False \\\n        --model_args 'tokenized_requests=False' \\\n        --output_path logs\n```\n#### [2025.01.29] New Reasoning Benchmarks\n\n- AIME24, AMC23, MATH500, LiveCodeBench, GPQADiamond, HumanEvalPlus, MBPPPlus, BigCodeBench, MultiPL-E, and CRUXEval have been added to our growing list of [available benchmarks](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fevalchemy?tab=readme-ov-file#built-in-benchmarks). This is part of the effort in the [Open Thoughts](https:\u002F\u002Fgithub.com\u002Fopen-thoughts\u002Fopen-thoughts) project. See the [our blog post](https:\u002F\u002Fwww.open-thoughts.ai\u002Fblog\u002Fmeasure) on using Evalchemy for measuring reasoning models. \n\n#### [2025.01.28] New Model Support\n- [vLLM models](https:\u002F\u002Fblog.vllm.ai\u002F2023\u002F06\u002F20\u002Fvllm.html): High-performance inference and serving engine with PagedAttention technology\n```bash\npython -m eval.eval \\\n    --model vllm \\\n    --tasks alpaca_eval \\\n    --model_args \"pretrained=meta-llama\u002FMeta-Llama-3-8B-Instruct\" \\\n    --batch_size 16 \\\n    --output_path logs\n```\n- [OpenAI models](https:\u002F\u002Fopenai.com\u002F): Full support for OpenAI's model lineup\n```bash\npython -m eval.eval \\\n    --model openai-chat-completions \\\n    --tasks alpaca_eval \\\n    --model_args \"model=gpt-4o-mini-2024-07-18,num_concurrent=32\" \\\n    --batch_size 16 \\\n    --output_path logs \n```\n\n### Key Features\n\n- **Unified Installation**: One-step setup for all benchmarks, eliminating dependency conflicts\n- **Parallel Evaluation**:\n  - Data-Parallel: Distribute evaluations across multiple GPUs for faster results\n  - Model-Parallel: Handle large models that don't fit on a single GPU\n- **Simplified Usage**: Run any benchmark with a consistent command-line interface\n- **Results Management**: \n  - Local results tracking with standardized output format\n  - Optional database integration for systematic tracking\n  - Leaderboard submission capability (requires database setup)\n\n## ⚡ Quick Start\n\n### Installation\n\nWe suggest using conda ([installation instructions](https:\u002F\u002Fdocs.anaconda.com\u002Fminiconda\u002Finstall\u002F#quick-command-line-install)). \n\n```bash\n# Create and activate conda environment\nconda create --name evalchemy python=3.10\nconda activate evalchemy\n\n# Clone the repo\ngit clone git@github.com:mlfoundations\u002Fevalchemy.git   \ncd evalchemy\n\n# Install dependencies\npip install -e .\npip install -e eval\u002Fchat_benchmarks\u002Falpaca_eval\n\n# Note: On some HPC systems you may need to modify pyproject.toml \n# to use absolute paths for the fschat dependency:\n# Change: \"fschat @ file:eval\u002Fchat_benchmarks\u002FMTBench\"\n# To:     \"fschat @ file:\u002F\u002F\u002Fabsolute\u002Fpath\u002Fto\u002Fevalchemy\u002Feval\u002Fchat_benchmarks\u002FMTBench\"\n# Or remove entirely and separately run\n# pip install -e eval\u002Fchat_benchmarks\u002FMTBench \n\n# Log into HuggingFace for datasets and models.\nhuggingface-cli login\n```\n\n## 📚 Available Tasks\n\n### Built-in Benchmarks\n- All tasks from [LM Evaluation Harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness)\n- Custom instruction-based tasks (found in [`eval\u002Fchat_benchmarks\u002F`](eval\u002Fchat_benchmarks\u002F)):\n  - **MTBench**: [Multi-turn dialogue evaluation benchmark](https:\u002F\u002Fgithub.com\u002Fmtbench101\u002Fmt-bench-101)\n  - **WildBench**: [Real-world task evaluation](https:\u002F\u002Fgithub.com\u002Fallenai\u002FWildBench)\n  - **RepoBench**: [Code understanding and repository-level tasks](https:\u002F\u002Fgithub.com\u002FLeolty\u002Frepobench)\n  - **MixEval**: [Comprehensive evaluation across domains](https:\u002F\u002Fgithub.com\u002FPsycoy\u002FMixEval)\n  - **IFEval**: [Instruction following capability evaluation](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fgoogle-research\u002Ftree\u002Fmaster\u002Finstruction_following_eval)\n  - **AlpacaEval**: [Instruction following evaluation](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval)\n  - **HumanEval**: [Code generation and problem solving](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fhuman-eval)\n  - **HumanEvalPlus**: [HumanEval with more test cases](https:\u002F\u002Fgithub.com\u002Fevalplus\u002Fevalplus)\n  - **ZeroEval**: [Logical reasoning and problem solving](https:\u002F\u002Fgithub.com\u002FWildEval\u002FZeroEval)\n  - **MBPP**: [Python programming benchmark](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fgoogle-research\u002Ftree\u002Fmaster\u002Fmbpp)\n  - **MBPPPlus**: [MBPP with more test cases](https:\u002F\u002Fgithub.com\u002Fevalplus\u002Fevalplus)\n  - **BigCodeBench:** [Benchmarking Code Generation with Diverse Function Calls and Complex Instructions](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.15877)\n\n    > **🚨 Warning:** for BigCodeBench evaluation, we strongly recommend using a Docker container since the execution of LLM generated code on a machine can lead to destructive outcomes. More info is [here](eval\u002Fchat_benchmarks\u002FBigCodeBench\u002FREADME.md).\n  - **MultiPL-E:** [Multi-Programming Language Evaluation of Large Language Models of Code](https:\u002F\u002Fgithub.com\u002Fnuprl\u002FMultiPL-E\u002F)\n  - **CRUXEval:** [Code Reasoning, Understanding, and Execution Evaluation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.03065)\n  - **AIME24**: [Math Reasoning Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdi-zhang-fdu\u002FAIME_1983_2024)\n  - **AIME25**: [Math Reasoning Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FTIGER-Lab\u002FAIME25)\n  - **AMC23**: [Math Reasoning Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAI-MO\u002Faimo-validation-amc)\n  - **MATH500**: [Math Reasoning Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceH4\u002FMATH-500) split from [Let's Verify Step by Step](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fprm800k\u002Ftree\u002Fmain?tab=readme-ov-file#math-splits)\n  - **LiveCodeBench**: [Benchmark of LLMs for code](https:\u002F\u002Flivecodebench.github.io\u002F)\n  - **LiveBench**: [A benchmark for LLMs designed with test set contamination and objective evaluation in mind](https:\u002F\u002Flivebench.ai\u002F#\u002F)\n  - **GPQA Diamond**: [A Graduate-Level Google-Proof Q&A Benchmark](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FIdavidrein\u002Fgpqa)\n  - **Alice in Wonderland**: [Simple Tasks Showing Complete Reasoning Breakdown in LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.02061)\n  - **Arena-Hard-Auto** (Coming soon): [Automatic evaluation tool for instruction-tuned LLMs](https:\u002F\u002Fgithub.com\u002Flmarena\u002Farena-hard-auto)\n  - **SWE-Bench** (Coming soon): [Evaluating large language models on real-world software issues](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSWE-bench)\n  - **SafetyBench** (Coming soon): [Evaluating the safety of LLMs](https:\u002F\u002Fgithub.com\u002Fthu-coai\u002FSafetyBench)\n  - **SciCode Bench** (Coming soon): [Evaluate language models in generating code for solving realistic scientific research problems](https:\u002F\u002Fgithub.com\u002Fscicode-bench\u002FSciCode)\n  - **Berkeley Function Calling Leaderboard** (Coming soon): [Evaluating ability of LLMs to use APIs](https:\u002F\u002Fgorilla.cs.berkeley.edu\u002Fblogs\u002F13_bfcl_v3_multi_turn.html)\n  \n\nWe have recorded reproduced results against published numbers for these benchmarks in [`reproduced_benchmarks.md`](reproduced_benchmarks.md).\n\n\n### Basic Usage\n\nMake sure your `OPENAI_API_KEY` is set in your environment before running evaluations, if an LLM judge is required. \n\n```bash\npython -m eval.eval \\\n    --model hf \\\n    --tasks HumanEval,mmlu \\\n    --model_args \"pretrained=mistralai\u002FMistral-7B-Instruct-v0.3\" \\\n    --batch_size 2 \\\n    --output_path logs\n```\n\nThe results will be written out in `output_path`. If you have `jq` [installed](https:\u002F\u002Fjqlang.github.io\u002Fjq\u002Fdownload\u002F), you can view the results easily after evaluation. Example: `jq '.results' logs\u002FQwen__Qwen2.5-7B-Instruct\u002Fresults_2024-11-17T17-12-28.668908.json`\n\n**Args**: \n\n- `--model`: Which model type or provider is evaluated (example: hf)\n- `--tasks`: Comma-separated list of tasks to be evaluated.\n- `--model_args`: Model path and parameters. Comma-separated list of parameters passed to the model constructor. Accepts a string of the format `\"arg1=val1,arg2=val2,...\"`. You can find the list supported arguments [here](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002F365fcda9b85bbb6e0572d91976b8daf409164500\u002Flm_eval\u002Fmodels\u002Fhuggingface.py#L66).\n- `--batch_size`: Batch size for inference\n- `--output_path`: Directory to save evaluation results\n\nExample running multiple benchmarks:\n```bash\npython -m eval.eval \\\n    --model hf \\\n    --tasks MTBench,WildBench,alpaca_eval \\\n    --model_args \"pretrained=mistralai\u002FMistral-7B-Instruct-v0.3\" \\\n    --batch_size 2 \\\n    --output_path logs\n```\n\n**Config shortcuts**: \n\nTo be able to reuse commonly used settings without having to manually supply full arguments every time, we support reading eval configs from YAML files. These configs replace the `--batch_size`, `--tasks`, and `--annoator_model` arguments. Some example config files can be found in `.\u002Fconfigs`. To use these configs, you can use the `--config` flag as shown below:\n\n```bash\npython -m eval.eval \\\n    --model hf \\\n    --model_args \"pretrained=mistralai\u002FMistral-7B-Instruct-v0.3\" \\\n    --output_path logs \\\n    --config configs\u002Flight_gpt4omini0718.yaml\n```\n\nWe add several more command examples in [`eval\u002Fexamples`](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002FEvalchemy\u002Ftree\u002Fmain\u002Feval\u002Fexamples) to help you start using Evalchemy. \n\n## 🔧 Advanced Usage\n\n### Support for different models\n\nThrough LM-Eval-Harness, we support all HuggingFace models and are currently adding support for all LM-Eval-Harness models, such as OpenAI and VLLM. For more information on such models, please check out the [models page](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Ftree\u002Fmain\u002Flm_eval\u002Fmodels).\n\nTo choose a model, simply set 'pretrained=\u003Cname of hf model>' where the model name can either be a HuggingFace model name or a path to a local model. \n\n\n### HPC Distributed Evaluation\n\nFor even faster evaluation, use full data parallelism and launch a vLLM process for each GPU. \n\nWe have made also made this easy to do at scale across multiple nodes on HPC (High-Performance Computing) clusters:\n\n```bash\npython eval\u002Fdistributed\u002Flaunch.py --model_name \u003Cmodel_id> --tasks \u003Ctask_list> --num_shards \u003Cn> --watchdog\n```\n\nKey features:\n- Run evaluations in parallel across multiple compute nodes\n- Dramatically reduce wall clock time for large benchmarks\n- Offline mode support for environments without internet access on GPU nodes\n- Automatic cluster detection and configuration\n- Efficient result collection and scoring\n\nRefer to the [distributed README](eval\u002Fdistributed\u002FREADME.md) for more details. \n\nNOTE: This is configured for specific HPC clusters, but can easily be adapted. Furthermore it can be adapted for a non-HPC setup using `CUDA_VISIBLE_DEVICES` instead of SLURM job arrays. \n\n\n### Multi-GPU Evaluation \n\nNOTE: this is slower than doing fully data parallel evaluation (see previous section)\n\n```bash\naccelerate launch --num-processes \u003Cnum-gpus> --num-machines \u003Cnum-nodes> \\\n    --multi-gpu -m eval.eval \\\n    --model hf \\\n    --tasks MTBench,alpaca_eval \\\n    --model_args 'pretrained=mistralai\u002FMistral-7B-Instruct-v0.3' \\\n    --batch_size 2 \\\n    --output_path logs\n```\n\n### Large Model Evaluation\n\nFor models that don't fit on a single GPU, use model parallelism:\n\n```bash\npython -m eval.eval \\\n    --model hf \\\n    --tasks MTBench,alpaca_eval \\\n    --model_args 'pretrained=mistralai\u002FMistral-7B-Instruct-v0.3,parallelize=True' \\\n    --batch_size 2 \\\n    --output_path logs\n```\n\n> **💡 Note**: While \"auto\" batch size is supported, we recommend manually tuning the batch size for optimal performance. The optimal batch size depends on the model size, GPU memory, and the specific benchmark. We used a maximum of 32 and a minimum of 4 (for RepoBench) to evaluate Llama-3-8B-Instruct on 8xH100 GPUs.\n\n### Output Log Structure\n\nOur generated logs include critical information about each evaluation to help inform your experiments. We highlight important items in our generated logs. \n\n- Model Configuration\n  - `model`: Model framework used\n  - `model_args`: Model arguments for the model framework\n  - `batch_size`: Size of processing batches\n  - `device`: Computing device specification\n  - `annotator_model`: Model used for annotation (\"gpt-4o-mini-2024-07-18\")\n- Seed Configuration\n  - `random_seed`: General random seed\n  - `numpy_seed`: NumPy-specific seed\n  - `torch_seed`: PyTorch-specific seed\n  - `fewshot_seed`: Seed for few-shot examples\n- Model Details\n  - `model_num_parameters`: Number of model parameters\n  - `model_dtype`: Model data type\n  - `model_revision`: Model version\n  - `model_sha`: Model commit hash\n\n- Version Control\n  - `git_hash`: Repository commit hash\n  - `date`: Unix timestamp of evaluation\n  - `transformers_version`: Hugging Face Transformers version\n- Tokenizer Configuration\n  - `tokenizer_pad_token`: Padding token details\n  - `tokenizer_eos_token`: End of sequence token\n  - `tokenizer_bos_token`: Beginning of sequence token\n  - `eot_token_id`: End of text token ID\n  - `max_length`: Maximum sequence length\n- Model Settings\n  - `model_source`: Model source platform\n  - `model_name`: Full model identifier\n  - `model_name_sanitized`: Sanitized model name for file system usage\n  - `chat_template`: Conversation template\n  - `chat_template_sha`: Template hash\n- Timing Information\n  - `start_time`: Evaluation start timestamp\n  - `end_time`: Evaluation end timestamp\n  - `total_evaluation_time_seconds`: Total duration\n- Hardware Environment\n  - PyTorch version and build configuration\n  - Operating system details\n  - GPU configuration\n  - CPU specifications\n  - CUDA and driver versions\n  - Relevant library versions\n\n### Customizing Evaluation\n\n#### 🤖 Change Annotator Model\n\nAs part of Evalchemy, we want to make swapping in different Language Model Judges for standard benchmarks easy. Currently, we support two judge settings. The first is the default setting, where we use a benchmark's default judge. To activate this, you can either do nothing or pass in\n```bash\n--annotator_model auto\n```\nIn addition to the default assignments, we support using gpt-4o-mini-2024-07-18 as a judge:\n\n```bash\n--annotator_model gpt-4o-mini-2024-07-18\n```\n\nWe are planning on adding support for different judges in the future!\n\n### ⏱️ Runtime and Cost Analysis\n\nEvalchemy makes running common benchmarks simple, fast, and versatile! We list the speeds and costs for each benchmark we achieve with Evalchemy for Meta-Llama-3-8B-Instruct on 8xH100 GPUs.\n\n| Benchmark | Runtime (8xH100) | Batch Size | Total Tokens | Default Judge Cost ($) | GPT-4o-mini Judge Cost ($) | Notes |\n|-----------|------------------|------------|--------------|----------------|-------------------|--------|\n| MTBench | 14:00 | 32 | ~196K | 6.40 | 0.05 | |\n| WildBench | 38:00 | 32 | ~2.2M | 30.00 | 0.43 | |\n| RepoBench | 46:00 | 4 | ~23K | - | - | Lower batch size due to memory |\n| MixEval | 13:00 | 32 | ~4-6M | 3.36 | 0.76 | Varies by judge model |\n| AlpacaEval | 16:00 | 32 | ~936K | 9.40 | 0.14 | |\n| HumanEval | 4:00 | 32 | ~300 | - | - | No API costs |\n| IFEval | 1:30 | 32 | ~550 | - | - | No API costs |\n| ZeroEval | 1:44:00 | 32 | ~8K | - | - | Longest runtime |\n| MBPP | 6:00 | 32 | 500 | - | - | No API costs |\n| MMLU | 7:00 | 32 | 500 | - | - | No API costs |\n| ARC | 4:00 | 32 | - | - | - | No API costs |\n| DROP | 20:00 | 32 | - | - | - | No API costs |\n\n**Notes:**\n- Runtimes measured using 8x H100 GPUs with Meta-Llama-3-8B-Instruct model\n- Batch sizes optimized for memory and speed\n- API costs vary based on judge model choice\n\n**Cost-Saving Tips:**\n- Use gpt-4o-mini-2024-07-18 judge when possible for significant cost savings\n- Adjust batch size based on available memory\n- Consider using data-parallel evaluation for faster results\n\n### 🔐 Special Access Requirements\n\n#### ZeroEval Access\nTo run ZeroEval benchmarks, you need to:\n\n1. Request access to the [ZebraLogicBench-private dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002FZebraLogicBench-private) on Hugging Face\n2. Accept the terms and conditions\n3. Log in to your Hugging Face account when running evaluations\n\n## 🛠️ Implementing Custom Evaluations\n\nTo add a new evaluation system:\n\n1. Create a new directory under `eval\u002Fchat_benchmarks\u002F`\n2. Implement `eval_instruct.py` with two required functions:\n   - `eval_instruct(model)`: Takes an LM Eval Model, returns results dict\n   - `evaluate(results)`: Takes results dictionary, returns evaluation metrics\n\n### Adding External Evaluation Repositories\n\nUse git subtree to manage external evaluation code:\n\n```bash\n# Add external repository\ngit subtree add --prefix=eval\u002Fchat_benchmarks\u002Fnew_eval https:\u002F\u002Fgithub.com\u002Foriginal\u002Frepo.git main --squash\n\n# Pull updates\ngit subtree pull --prefix=eval\u002Fchat_benchmarks\u002Fnew_eval https:\u002F\u002Fgithub.com\u002Foriginal\u002Frepo.git main --squash\n\n# Push contributions back\ngit subtree push --prefix=eval\u002Fchat_benchmarks\u002Fnew_eval https:\u002F\u002Fgithub.com\u002Foriginal\u002Frepo.git contribution-branch\n```\n\n### 🔍 Debug Mode\n\nTo run evaluations in debug mode, add the `--debug` flag:\n\n```bash\npython -m eval.eval \\\n    --model hf \\\n    --tasks MTBench \\\n    --model_args \"pretrained=mistralai\u002FMistral-7B-Instruct-v0.3\" \\\n    --batch_size 2 \\\n    --output_path logs \\\n    --debug\n```\n\nThis is particularly useful when testing new evaluation implementations, debugging model configurations, verifying dataset access, and testing database connectivity.\n\n### 🚀 Performance Tips\n\n1. Utilize batch processing for faster evaluation:\n```python\nall_instances.append(\n    Instance(\n        \"generate_until\",\n        example,\n        (\n            inputs,\n            {\n                \"max_new_tokens\": 1024,\n                \"do_sample\": False,\n            },\n        ),\n        idx,\n    )\n)\n\noutputs = self.compute(model, all_instances)\n```\n\n2. Use the LM-eval logger for consistent logging across evaluations\n\n### 🔧 Troubleshooting\nEvalchemy has been tested on CUDA 12.4. If you run into issues like this: `undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12`, try updating your CUDA version:\n```bash\nwget https:\u002F\u002Fdeveloper.download.nvidia.com\u002Fcompute\u002Fcuda\u002Frepos\u002Fdebian11\u002Fx86_64\u002Fcuda-keyring_1.1-1_all.deb\nsudo dpkg -i cuda-keyring_1.1-1_all.deb\nsudo add-apt-repository contrib\nsudo apt-get update\nsudo apt-get -y install cuda-toolkit-12-4\n```\n\n## 🏆 Leaderboard Integration\nTo track experiments and evaluations, we support logging results to a PostgreSQL database. Details on the entry schemas and database setup can be found in [`database\u002F`](database\u002F).\n\n\n## Contributing\nThank you to all the contributors for making this project possible!\nPlease follow [these instructions](CONTRIBUTING.md) on how to contribute.\n\n## Citation\nIf you find Evalchemy useful, please consider citing us!\n\n```\n@software{Evalchemy: Automatic evals for LLMs,\n  author = {Raoof, Negin and Guha, Etash Kumar and Marten, Ryan and Mercat, Jean and Frankel, Eric and Keh, Sedrick and Bansal, Hritik and Smyrnis, Georgios and Nezhurina, Marianna and Vu, Trung and Sprague, Zayne Rea and Merrill, Mike A and Chen, Liangyu and Choi, Caroline and Khan, Zaid and Grover, Sachin and Feuer, Benjamin and Suvarna, Ashima and Su, Shiye and Zhao, Wanjia and Sharma, Kartik and Ji, Charlie Cheng-Jie and Arora, Kushal and Li, Jeffrey and Gokaslan, Aaron and Pratt, Sarah M and Muennighoff, Niklas and Saad-Falcon, Jon and Yang, John and Aali, Asad and Pimpalgaonkar, Shreyas and Albalak, Alon and Dave, Achal and Pouransari, Hadi and Durrett, Greg and Oh, Sewoong and Hashimoto, Tatsunori and Shankar, Vaishaal and Choi, Yejin and Bansal, Mohit and Hegde, Chinmay and Heckel, Reinhard and Jitsev, Jenia and Sathiamoorthy, Maheswaran and Dimakis, Alex and Schmidt, Ludwig}\n  month = June,\n  title = {{Evalchemy}},\n  year = {2025}\n}\n```\n","# 🧪 Evalchemy\n\n> 一个统一且易于使用的工具包，用于评估后训练的语言模型\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmlfoundations_evalchemy_readme_c67227dbbf6a.png)\n\nEvalchemy 由 [DataComp 社区](https:\u002F\u002Fdatacomp.ai) 和 [Bespoke Labs](https:\u002F\u002Fbespokelabs.ai) 开发，并基于 [LM-Eval-Harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) 构建。\n\n\n## 🎉 最新动态 \n\n#### [2025.02.24] 新的推理基准测试\n\n- AIME25 和《爱丽丝梦游仙境》已被添加到[可用的基准测试](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fevalchemy?tab=readme-ov-file#built-in-benchmarks)中。\n\n#### [2025.01.30] API 模型支持\n\n- [通过 Curator 的 API 模型](https:\u002F\u002Fgithub.com\u002Fbespokelabsai\u002Fcurator\u002F)：使用 `--model curator` 参数，现在可以通过 [Curator](https:\u002F\u002Fgithub.com\u002Fbespokelabsai\u002Fcurator\u002F) 评估更多基于 API 的模型，包括所有由 [LiteLLM](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002Fproviders) 支持的模型。\n\n```\n  python -m eval.eval \\\n        --model curator  \\\n        --tasks AIME24,MATH500,GPQADiamond \\\n        --model_name \"gemini\u002Fgemini-2.0-flash-thinking-exp-01-21\" \\\n        --apply_chat_template False \\\n        --model_args 'tokenized_requests=False' \\\n        --output_path logs\n```\n\n#### [2025.01.29] 新的推理基准测试\n\n- AIME24、AMC23、MATH500、LiveCodeBench、GPQADiamond、HumanEvalPlus、MBPPPlus、BigCodeBench、MultiPL-E 和 CRUXEval 已被添加到我们不断增长的[可用基准测试列表](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fevalchemy?tab=readme-ov-file#built-in-benchmarks)中。这是 [Open Thoughts](https:\u002F\u002Fgithub.com\u002Fopen-thoughts\u002Fopen-thoughts) 项目的一部分。请参阅我们的[博客文章](https:\u002F\u002Fwww.open-thoughts.ai\u002Fblog\u002Fmeasure)，了解如何使用 Evalchemy 来评估推理模型。\n\n#### [2025.01.28] 新的模型支持\n- [vLLM 模型](https:\u002F\u002Fblog.vllm.ai\u002F2023\u002F06\u002F20\u002Fvllm.html)：采用 PagedAttention 技术的高性能推理和推理服务引擎\n```bash\npython -m eval.eval \\\n    --model vllm \\\n    --tasks alpaca_eval \\\n    --model_args \"pretrained=meta-llama\u002FMeta-Llama-3-8B-Instruct\" \\\n    --batch_size 16 \\\n    --output_path logs\n```\n- [OpenAI 模型](https:\u002F\u002Fopenai.com\u002F)：全面支持 OpenAI 的模型系列\n```bash\npython -m eval.eval \\\n    --model openai-chat-completions \\\n    --tasks alpaca_eval \\\n    --model_args \"model=gpt-4o-mini-2024-07-18,num_concurrent=32\" \\\n    --batch_size 16 \\\n    --output_path logs \n```\n\n### 核心功能\n\n- **统一安装**：所有基准测试一步到位，消除依赖冲突\n- **并行评估**：\n  - 数据并行：在多张 GPU 上分散评估任务，加快结果生成速度\n  - 模型并行：处理单个 GPU 无法容纳的大型模型\n- **简化使用**：通过一致的命令行界面运行任何基准测试\n- **结果管理**：\n  - 本地结果跟踪，输出格式标准化\n  - 可选数据库集成，实现系统化跟踪\n  - 排行榜提交功能（需设置数据库）\n\n## ⚡ 快速入门\n\n### 安装\n\n建议使用 conda（[安装说明](https:\u002F\u002Fdocs.anaconda.com\u002Fminiconda\u002Finstall\u002F#quick-command-line-install)）。\n\n```bash\n# 创建并激活 conda 环境\nconda create --name evalchemy python=3.10\nconda activate evalchemy\n\n# 克隆仓库\ngit clone git@github.com:mlfoundations\u002Fevalchemy.git   \ncd evalchemy\n\n# 安装依赖\npip install -e .\npip install -e eval\u002Fchat_benchmarks\u002Falpaca_eval\n\n# 注意：在某些 HPC 系统上，您可能需要修改 pyproject.toml，\n# 将 fschat 依赖项的路径改为绝对路径：\n# 将：\"fschat @ file:eval\u002Fchat_benchmarks\u002FMTBench\"\n# 替换为：\"fschat @ file:\u002F\u002F\u002Fabsolute\u002Fpath\u002Fto\u002Fevalchemy\u002Feval\u002Fchat_benchmarks\u002FMTBench\"\n# 或者直接移除，并单独运行\n# pip install -e eval\u002Fchat_benchmarks\u002FMTBench \n\n# 登录 HuggingFace 以获取数据集和模型。\nhuggingface-cli login\n```\n\n## 📚 可用任务\n\n### 内置基准测试\n- 来自 [LM Evaluation Harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) 的所有任务\n- 自定义基于指令的任务（位于 [`eval\u002Fchat_benchmarks\u002F`](eval\u002Fchat_benchmarks\u002F)）：\n  - **MTBench**: [多轮对话评估基准](https:\u002F\u002Fgithub.com\u002Fmtbench101\u002Fmt-bench-101)\n  - **WildBench**: [真实世界任务评估](https:\u002F\u002Fgithub.com\u002Fallenai\u002FWildBench)\n  - **RepoBench**: [代码理解和仓库级任务](https:\u002F\u002Fgithub.com\u002FLeolty\u002Frepobench)\n  - **MixEval**: [跨领域的综合评估](https:\u002F\u002Fgithub.com\u002FPsycoy\u002FMixEval)\n  - **IFEval**: [指令遵循能力评估](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fgoogle-research\u002Ftree\u002Fmaster\u002Finstruction_following_eval)\n  - **AlpacaEval**: [指令遵循评估](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval)\n  - **HumanEval**: [代码生成与问题求解](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fhuman-eval)\n  - **HumanEvalPlus**: [包含更多测试用例的 HumanEval](https:\u002F\u002Fgithub.com\u002Fevalplus\u002Fevalplus)\n  - **ZeroEval**: [逻辑推理与问题求解](https:\u002F\u002Fgithub.com\u002FWildEval\u002FZeroEval)\n  - **MBPP**: [Python 编程基准](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fgoogle-research\u002Ftree\u002Fmaster\u002Fmbpp)\n  - **MBPPPlus**: [包含更多测试用例的 MBPP](https:\u002F\u002Fgithub.com\u002Fevalplus\u002Fevalplus)\n  - **BigCodeBench**: [针对多样化函数调用和复杂指令的代码生成基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.15877)\n\n    > **🚨 警告：** 对于 BigCodeBench 的评估，我们强烈建议使用 Docker 容器，因为在主机上执行 LLM 生成的代码可能会导致破坏性后果。更多信息请参见 [这里](eval\u002Fchat_benchmarks\u002FBigCodeBench\u002FREADME.md)。\n  - **MultiPL-E**: [大型语言模型在多编程语言代码方面的评估](https:\u002F\u002Fgithub.com\u002Fnuprl\u002FMultiPL-E\u002F)\n  - **CRUXEval**: [代码推理、理解和执行评估](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.03065)\n  - **AIME24**: [数学推理数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdi-zhang-fdu\u002FAIME_1983_2024)\n  - **AIME25**: [数学推理数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FTIGER-Lab\u002FAIME25)\n  - **AMC23**: [数学推理数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAI-MO\u002Faimo-validation-amc)\n  - **MATH500**: [数学推理数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceH4\u002FMATH-500)，源自 [Let's Verify Step by Step](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fprm800k\u002Ftree\u002Fmain?tab=readme-ov-file#math-splits)\n  - **LiveCodeBench**: [LLMs 的代码基准测试](https:\u002F\u002Flivecodebench.github.io\u002F)\n  - **LiveBench**: [一个专为避免测试集污染并实现客观评估而设计的 LLM 基准](https:\u002F\u002Flivebench.ai\u002F#\u002F)\n  - **GPQA Diamond**: [研究生级别的防谷歌问答基准](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FIdavidrein\u002Fgpqa)\n  - **爱丽丝梦游仙境**: [展示 LLM 完全推理失效的简单任务](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.02061)\n  - **Arena-Hard-Auto**（即将推出）：[面向指令微调 LLM 的自动评估工具](https:\u002F\u002Fgithub.com\u002Flmarena\u002Farena-hard-auto)\n  - **SWE-Bench**（即将推出）：[评估大型语言模型处理现实软件问题的能力](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSWE-bench)\n  - **SafetyBench**（即将推出）：[评估 LLM 的安全性](https:\u002F\u002Fgithub.com\u002Fthu-coai\u002FSafetyBench)\n  - **SciCode Bench**（即将推出）：[评估语言模型生成用于解决实际科学研究问题的代码的能力](https:\u002F\u002Fgithub.com\u002Fscicode-bench\u002FSciCode)\n  - **伯克利函数调用排行榜**（即将推出）：[评估 LLM 使用 API 的能力](https:\u002F\u002Fgorilla.cs.berkeley.edu\u002Fblogs\u002F13_bfcl_v3_multi_turn.html)\n\n我们已在 [`reproduced_benchmarks.md`](reproduced_benchmarks.md) 中记录了这些基准测试的复现结果，并与已发表的数据进行了对比。\n\n\n### 基本用法\n\n如果需要使用 LLM 作为评判者，请确保在运行评估之前已在环境中设置好 `OPENAI_API_KEY`。\n\n```bash\npython -m eval.eval \\\n    --model hf \\\n    --tasks HumanEval,mmlu \\\n    --model_args \"pretrained=mistralai\u002FMistral-7B-Instruct-v0.3\" \\\n    --batch_size 2 \\\n    --output_path logs\n```\n\n结果将被写入 `output_path` 目录中。如果您已安装 `jq` [下载地址](https:\u002F\u002Fjqlang.github.io\u002Fjq\u002Fdownload\u002F)，可以在评估完成后轻松查看结果。例如：`jq '.results' logs\u002FQwen__Qwen2.5-7B-Instruct\u002Fresults_2024-11-17T17-12-28.668908.json`\n\n**参数说明**：\n\n- `--model`: 指定要评估的模型类型或提供商（例如：hf）\n- `--tasks`: 以逗号分隔的任务列表，表示要评估的基准测试\n- `--model_args`: 模型路径及参数。以逗号分隔的参数列表，传递给模型构造函数。格式为 `\"arg1=val1,arg2=val2,...\"`。支持的参数列表可在 [此处](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002F365fcda9b85bbb6e0572d91976b8daf409164500\u002Flm_eval\u002Fmodels\u002Fhuggingface.py#L66) 查看。\n- `--batch_size`: 推理时的批处理大小\n- `--output_path`: 保存评估结果的目录\n\n示例：运行多个基准测试：\n```bash\npython -m eval.eval \\\n    --model hf \\\n    --tasks MTBench,WildBench,alpaca_eval \\\n    --model_args \"pretrained=mistralai\u002FMistral-7B-Instruct-v0.3\" \\\n    --batch_size 2 \\\n    --output_path logs\n```\n\n**配置快捷方式**：\n\n为了能够重复使用常用设置，而无需每次都手动提供完整参数，我们支持从 YAML 文件中读取评估配置。这些配置文件可以替代 `--batch_size`、`--tasks` 和 `--annotator_model` 参数。一些示例配置文件位于 `.\u002Fconfigs` 目录中。要使用这些配置文件，可以使用 `--config` 标志，如下所示：\n\n```bash\npython -m eval.eval \\\n    --model hf \\\n    --model_args \"pretrained=mistralai\u002FMistral-7B-Instruct-v0.3\" \\\n    --output_path logs \\\n    --config configs\u002Flight_gpt4omini0718.yaml\n```\n\n我们在 [`eval\u002Fexamples`](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002FEvalchemy\u002Ftree\u002Fmain\u002Feval\u002Fexamples) 中添加了更多命令示例，帮助您快速上手 Evalchemy。\n\n## 🔧 高级用法\n\n### 对不同模型的支持\n\n通过 LM-Eval-Harness，我们支持所有 HuggingFace 模型，并且目前正在增加对 OpenAI 和 VLLM 等其他 LM-Eval-Harness 支持的模型的支持。有关这些模型的更多信息，请访问 [模型页面](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Ftree\u002Fmain\u002Flm_eval\u002Fmodels)。\n\n要选择模型，只需设置 `pretrained=\u003CHuggingFace 模型名称>`，其中模型名称可以是 HuggingFace 官方模型名，也可以是本地模型的路径。\n\n### HPC 分布式评估\n\n为了获得更快的评估速度，可以使用完全的数据并行，并为每块 GPU 启动一个 vLLM 进程。\n\n我们还使这一过程在 HPC（高性能计算）集群上的多节点环境中大规模执行变得非常简单：\n\n```bash\npython eval\u002Fdistributed\u002Flaunch.py --model_name \u003Cmodel_id> --tasks \u003Ctask_list> --num_shards \u003Cn> --watchdog\n```\n\n关键特性：\n- 在多个计算节点上并行运行评估\n- 极大地缩短大型基准测试的墙钟时间\n- 支持离线模式，适用于 GPU 节点上无互联网连接的环境\n- 自动集群检测与配置\n- 高效的结果收集与评分\n\n更多详细信息请参阅 [分布式 README](eval\u002Fdistributed\u002FREADME.md)。\n\n注意：此配置针对特定的 HPC 集群，但可以轻松调整。此外，也可以通过使用 `CUDA_VISIBLE_DEVICES` 代替 SLURM 作业数组来适应非 HPC 环境。\n\n### 多 GPU 评估\n\n注意：这比完全数据并行评估（见上一节）要慢。\n\n```bash\naccelerate launch --num-processes \u003Cnum-gpus> --num-machines \u003Cnum-nodes> \\\n    --multi-gpu -m eval.eval \\\n    --model hf \\\n    --tasks MTBench,alpaca_eval \\\n    --model_args 'pretrained=mistralai\u002FMistral-7B-Instruct-v0.3' \\\n    --batch_size 2 \\\n    --output_path logs\n```\n\n### 大模型评估\n\n对于无法容纳在单个 GPU 上的模型，请使用模型并行：\n\n```bash\npython -m eval.eval \\\n    --model hf \\\n    --tasks MTBench,alpaca_eval \\\n    --model_args 'pretrained=mistralai\u002FMistral-7B-Instruct-v0.3,parallelize=True' \\\n    --batch_size 2 \\\n    --output_path logs\n```\n\n> **💡 注意**：虽然支持“自动”批大小，但我们建议手动调整批大小以获得最佳性能。最优批大小取决于模型大小、GPU 内存以及具体基准测试。我们在 8 块 H100 GPU 上评估 Llama-3-8B-Instruct 时，最大批大小设为 32，最小批大小设为 4（用于 RepoBench）。\n\n### 输出日志结构\n\n我们生成的日志包含了每次评估的关键信息，有助于指导您的实验。以下是我们日志中突出显示的重要内容：\n\n- 模型配置\n  - `model`: 使用的模型框架\n  - `model_args`: 模型框架的参数\n  - `batch_size`: 处理批次大小\n  - `device`: 计算设备规格\n  - `annotator_model`: 用于标注的模型（例如 “gpt-4o-mini-2024-07-18”）\n- 随机种子配置\n  - `random_seed`: 通用随机种子\n  - `numpy_seed`: NumPy 特定的种子\n  - `torch_seed`: PyTorch 特定的种子\n  - `fewshot_seed`: 少样本示例的种子\n- 模型详情\n  - `model_num_parameters`: 模型参数数量\n  - `model_dtype`: 模型数据类型\n  - `model_revision`: 模型版本\n  - `model_sha`: 模型提交哈希值\n- 版本控制\n  - `git_hash`: 仓库提交哈希值\n  - `date`: 评估的 Unix 时间戳\n  - `transformers_version`: Hugging Face Transformers 版本\n- 分词器配置\n  - `tokenizer_pad_token`: 填充标记详情\n  - `tokenizer_eos_token`: 句子结束标记\n  - `tokenizer_bos_token`: 句子开始标记\n  - `eot_token_id`: 文本结束标记 ID\n  - `max_length`: 最大序列长度\n- 模型设置\n  - `model_source`: 模型来源平台\n  - `model_name`: 完整的模型标识符\n  - `model_name_sanitized`: 用于文件系统的清理后模型名称\n  - `chat_template`: 对话模板\n  - `chat_template_sha`: 模板哈希值\n- 计时信息\n  - `start_time`: 评估开始时间戳\n  - `end_time`: 评估结束时间戳\n  - `total_evaluation_time_seconds`: 总耗时\n- 硬件环境\n  - PyTorch 版本及构建配置\n  - 操作系统详情\n  - GPU 配置\n  - CPU 规格\n  - CUDA 和驱动程序版本\n  - 相关库版本\n\n### 自定义评估\n\n#### 🤖 更改标注模型\n\n作为 Evalchemy 的一部分，我们希望在标准基准测试中轻松更换不同的语言模型评判者。目前，我们支持两种评判者设置。第一种是默认设置，即使用基准测试的默认评判者。要激活此设置，您可以什么都不做，或者传递以下参数：\n\n```bash\n--annotator_model auto\n```\n\n除了默认设置外，我们还支持使用 gpt-4o-mini-2024-07-18 作为评判者：\n\n```bash\n--annotator_model gpt-4o-mini-2024-07-18\n```\n\n我们计划在未来添加对不同评判者的支持！\n\n### ⏱️ 运行时间和成本分析\n\nEvalchemy 使得运行常见基准测试变得简单、快速且灵活！我们列出了使用 Evalchemy 在 8 块 H100 GPU 上对 Meta-Llama-3-8B-Instruct 进行评估时，各项基准测试的速度和成本。\n\n| 基准测试 | 运行时间（8xH100） | 批次大小 | 总令牌数 | 默认评判者成本 ($) | gpt-4o-mini 评判者成本 ($) | 备注 |\n|-----------|------------------|------------|--------------|----------------|-------------------|--------|\n| MTBench | 14:00 | 32 | ~196K | 6.40 | 0.05 | |\n| WildBench | 38:00 | 32 | ~2.2M | 30.00 | 0.43 | |\n| RepoBench | 46:00 | 4 | ~23K | - | - | 由于内存限制，批次较小 |\n| MixEval | 13:00 | 32 | ~4-6M | 3.36 | 0.76 | 根据评判者模型而异 |\n| AlpacaEval | 16:00 | 32 | ~936K | 9.40 | 0.14 | |\n| HumanEval | 4:00 | 32 | ~300 | - | - | 无 API 费用 |\n| IFEval | 1:30 | 32 | ~550 | - | - | 无 API 费用 |\n| ZeroEval | 1:44:00 | 32 | ~8K | - | - | 运行时间最长 |\n| MBPP | 6:00 | 32 | 500 | - | - | 无 API 费用 |\n| MMLU | 7:00 | 32 | 500 | - | - | 无 API 费用 |\n| ARC | 4:00 | 32 | - | - | - | 无 API 费用 |\n| DROP | 20:00 | 32 | - | - | - | 无 API 费用 |\n\n**备注：**\n- 运行时间是在 8 块 H100 GPU 上使用 Meta-Llama-3-8B-Instruct 模型测得的。\n- 批次大小已针对内存和速度进行了优化。\n- API 费用因选择的评判者模型而异。\n\n**省费提示：**\n- 尽可能使用 gpt-4o-mini-2024-07-18 评判者以显著节省成本。\n- 根据可用内存调整批次大小。\n- 考虑使用数据并行评估以获得更快的结果。\n\n### 🔐 特殊访问要求\n\n#### ZeroEval 访问\n要运行 ZeroEval 基准测试，您需要：\n\n1. 在 Hugging Face 上申请访问 [ZebraLogicBench-private 数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002FZebraLogicBench-private)\n2. 接受条款和条件\n3. 在运行评估时登录您的 Hugging Face 账户\n\n## 🛠️ 实现自定义评估\n\n要添加新的评估系统：\n\n1. 在 `eval\u002Fchat_benchmarks\u002F` 下创建一个新的目录。\n2. 实现 `eval_instruct.py`，其中包含两个必需的函数：\n   - `eval_instruct(model)`: 接受 LM 评估模型，返回结果字典。\n   - `evaluate(results)`: 接受结果字典，返回评估指标。\n\n### 添加外部评估代码库\n\n使用 git subtree 来管理外部评估代码：\n\n```bash\n# 添加外部代码库\ngit subtree add --prefix=eval\u002Fchat_benchmarks\u002Fnew_eval https:\u002F\u002Fgithub.com\u002Foriginal\u002Frepo.git main --squash\n\n# 拉取更新\ngit subtree pull --prefix=eval\u002Fchat_benchmarks\u002Fnew_eval https:\u002F\u002Fgithub.com\u002Foriginal\u002Frepo.git main --squash\n\n# 将贡献推回\ngit subtree push --prefix=eval\u002Fchat_benchmarks\u002Fnew_eval https:\u002F\u002Fgithub.com\u002Foriginal\u002Frepo.git contribution-branch\n```\n\n### 🔍 调试模式\n\n要在调试模式下运行评估，请添加 `--debug` 标志：\n\n```bash\npython -m eval.eval \\\n    --model hf \\\n    --tasks MTBench \\\n    --model_args \"pretrained=mistralai\u002FMistral-7B-Instruct-v0.3\" \\\n    --batch_size 2 \\\n    --output_path logs \\\n    --debug\n```\n\n这在测试新的评估实现、调试模型配置、验证数据集访问权限以及测试数据库连接时特别有用。\n\n### 🚀 性能提示\n\n1. 利用批处理以加快评估速度：\n```python\nall_instances.append(\n    Instance(\n        \"generate_until\",\n        example,\n        (\n            inputs,\n            {\n                \"max_new_tokens\": 1024,\n                \"do_sample\": False,\n            },\n        ),\n        idx,\n    )\n)\n\noutputs = self.compute(model, all_instances)\n```\n\n2. 使用 LM-eval 日志记录器，以确保跨评估的日志一致性。\n\n### 🔧 故障排除\nEvalchemy 已在 CUDA 12.4 上进行了测试。如果您遇到类似以下问题：`undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12`，请尝试更新您的 CUDA 版本：\n```bash\nwget https:\u002F\u002Fdeveloper.download.nvidia.com\u002Fcompute\u002Fcuda\u002Frepos\u002Fdebian11\u002Fx86_64\u002Fcuda-keyring_1.1-1_all.deb\nsudo dpkg -i cuda-keyring_1.1-1_all.deb\nsudo add-apt-repository contrib\nsudo apt-get update\nsudo apt-get -y install cuda-toolkit-12-4\n```\n\n## 🏆 排行榜集成\n为了跟踪实验和评估结果，我们支持将结果记录到 PostgreSQL 数据库中。有关条目模式和数据库设置的详细信息，请参阅 [`database\u002F`](database\u002F)。\n\n## 贡献\n感谢所有为使该项目成为可能而做出贡献的开发者！\n请按照 [这些说明](CONTRIBUTING.md) 来了解如何参与贡献。\n\n## 引用\n如果您觉得 Evalchemy 有用，请考虑引用我们！\n\n```\n@software{Evalchemy: 自动化的大语言模型评估工具，\n  author = {Raoof, Negin 和 Guha, Etash Kumar 和 Marten, Ryan 和 Mercat, Jean 和 Frankel, Eric 和 Keh, Sedrick 和 Bansal, Hritik 和 Smyrnis, Georgios 和 Nezhurina, Marianna 和 Vu, Trung 和 Sprague, Zayne Rea 和 Merrill, Mike A 和 Chen, Liangyu 和 Choi, Caroline 和 Khan, Zaid 和 Grover, Sachin 和 Feuer, Benjamin 和 Suvarna, Ashima 和 Su, Shiye 和 Zhao, Wanjia 和 Sharma, Kartik 和 Ji, Charlie Cheng-Jie 和 Arora, Kushal 和 Li, Jeffrey 和 Gokaslan, Aaron 和 Pratt, Sarah M 和 Muennighoff, Niklas 和 Saad-Falcon, Jon 和 Yang, John 和 Aali, Asad 和 Pimpalgaonkar, Shreyas 和 Albalak, Alon 和 Dave, Achal 和 Pouransari, Hadi 和 Durrett, Greg 和 Oh, Sewoong 和 Hashimoto, Tatsunori 和 Shankar, Vaishaal 和 Choi, Yejin 和 Bansal, Mohit 和 Hegde, Chinmay 和 Heckel, Reinhard 和 Jitsev, Jenia 和 Sathiamoorthy, Maheswaran 和 Dimakis, Alex 和 Schmidt, Ludwig}\n  month = 六月,\n  title = {{Evalchemy}},\n  year = {2025}\n}\n```","# Evalchemy 快速上手指南\n\nEvalchemy 是一个统一且易用的工具包，专为评估后训练语言模型（Post-trained LLMs）而设计。它由 DataComp 社区和 Bespoke Labs 开发，基于 LM-Eval-Harness 构建，支持多种基准测试、并行评估及结果管理。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **操作系统**: Linux (推荐) 或 macOS。\n*   **Python 版本**: 3.10 (强烈建议使用 Conda 管理环境)。\n*   **依赖工具**:\n    *   [Conda](https:\u002F\u002Fdocs.anaconda.com\u002Fminiconda\u002Finstall\u002F#quick-command-line-install) (用于环境隔离)\n    *   Git (用于克隆代码库)\n    *   Hugging Face CLI (用于下载模型和数据集)\n*   **硬件**: 至少一张 NVIDIA GPU (用于本地模型推理)。若评估需要 LLM 作为裁判（如 AlpacaEval），需配置 `OPENAI_API_KEY`。\n\n> **国内开发者提示**：若访问 Hugging Face 或 GitHub 较慢，建议配置镜像源或使用代理加速。\n> *   pip 镜像：`pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple ...`\n> *   Hugging Face 镜像：设置环境变量 `export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com`\n\n## 安装步骤\n\n推荐使用 Conda 创建独立环境以避免依赖冲突。\n\n1.  **创建并激活 Conda 环境**\n    ```bash\n    conda create --name evalchemy python=3.10\n    conda activate evalchemy\n    ```\n\n2.  **克隆仓库**\n    ```bash\n    git clone git@github.com:mlfoundations\u002Fevalchemy.git   \n    cd evalchemy\n    ```\n\n3.  **安装核心依赖**\n    ```bash\n    # 安装主包\n    pip install -e .\n    \n    # 安装聊天基准测试依赖 (如 AlpacaEval)\n    pip install -e eval\u002Fchat_benchmarks\u002Falpaca_eval\n    ```\n    > **注意**：在某些高性能计算 (HPC) 系统中，若遇到 `fschat` 路径问题，可能需要手动修改 `pyproject.toml` 使用绝对路径，或单独运行 `pip install -e eval\u002Fchat_benchmarks\u002FMTBench`。\n\n4.  **登录 Hugging Face**\n    您需要登录以下载受保护的模型和数据集：\n    ```bash\n    huggingface-cli login\n    ```\n    *(按提示输入您的 Access Token)*\n\n5.  **(可选) 配置 API Key**\n    如果运行的基准测试需要调用外部 API 进行评判（例如使用 GPT-4 作为裁判）：\n    ```bash\n    export OPENAI_API_KEY=\"your-api-key-here\"\n    ```\n\n## 基本使用\n\n安装完成后，您可以使用统一的命令行接口运行评估。以下是最基础的用法示例。\n\n### 运行单个基准测试\n\n以下命令使用 Hugging Face 模型 (`mistralai\u002FMistral-7B-Instruct-v0.3`) 在 `HumanEval` (代码生成) 和 `mmlu` (知识问答) 基准上进行评估：\n\n```bash\npython -m eval.eval \\\n    --model hf \\\n    --tasks HumanEval,mmlu \\\n    --model_args \"pretrained=mistralai\u002FMistral-7B-Instruct-v0.3\" \\\n    --batch_size 2 \\\n    --output_path logs\n```\n\n**参数说明：**\n*   `--model`: 模型类型，`hf` 代表 Hugging Face 模型。\n*   `--tasks`: 要评估的任务列表，用逗号分隔。\n*   `--model_args`: 模型路径及参数。`pretrained` 可以是 Hugging Face 模型 ID 或本地路径。\n*   `--batch_size`: 推理批大小，根据显存大小调整。\n*   `--output_path`: 结果保存目录。\n\n### 查看结果\n\n评估完成后，结果将以 JSON 格式保存在 `output_path` 指定的目录中。如果您安装了 `jq` 工具，可以快速查看结果摘要：\n\n```bash\n# 示例：查看最新生成的结果文件中的 results 字段\njq '.results' logs\u002FQwen__Qwen2.5-7B-Instruct\u002Fresults_*.json\n```\n\n### 使用预设配置文件 (简化操作)\n\n为了避免每次输入冗长的参数，Evalchemy 支持通过 YAML 文件加载常用配置（位于 `.\u002Fconfigs` 目录）：\n\n```bash\npython -m eval.eval \\\n    --model hf \\\n    --model_args \"pretrained=mistralai\u002FMistral-7B-Instruct-v0.3\" \\\n    --output_path logs \\\n    --config configs\u002Flight_gpt4omini0718.yaml\n```\n\n### 支持的其他模型类型\n\n除了本地 Hugging Face 模型，Evalchemy 还支持通过参数切换其他后端：\n\n*   **vLLM (高性能推理)**:\n    ```bash\n    python -m eval.eval \\\n        --model vllm \\\n        --tasks alpaca_eval \\\n        --model_args \"pretrained=meta-llama\u002FMeta-Llama-3-8B-Instruct\" \\\n        --batch_size 16 \\\n        --output_path logs\n    ```\n*   **OpenAI API 模型**:\n    ```bash\n    python -m eval.eval \\\n        --model openai-chat-completions \\\n        --tasks alpaca_eval \\\n        --model_args \"model=gpt-4o-mini-2024-07-18,num_concurrent=32\" \\\n        --batch_size 16 \\\n        --output_path logs \n    ```\n*   **Curator (支持更多 API 模型，如 Gemini)**:\n    ```bash\n    python -m eval.eval \\\n          --model curator  \\\n          --tasks AIME24,MATH500,GPQADiamond \\\n          --model_name \"gemini\u002Fgemini-2.0-flash-thinking-exp-01-21\" \\\n          --apply_chat_template False \\\n          --model_args 'tokenized_requests=False' \\\n          --output_path logs\n    ```\n\n更多高级用法（如多卡并行、分布式集群评估）请参考项目官方文档。","某 AI 初创团队正在为医疗咨询场景微调一款大语言模型，需要在发布前快速验证其在专业问答（MedQA）、逻辑推理（MATH500）及代码生成（HumanEvalPlus）等多个维度的综合表现。\n\n### 没有 evalchemy 时\n- **环境配置噩梦**：不同评测基准依赖冲突严重，团队需花费数天手动解决 Python 包兼容性问题，导致评估工作迟迟无法启动。\n- **多卡资源浪费**：缺乏原生的数据并行支持，面对大规模测试集只能单卡串行跑分，耗时从几小时拉长至数天，严重拖慢迭代节奏。\n- **结果管理混乱**：各基准输出格式不统一，人工整理 CSV 和日志极易出错，难以横向对比模型在不同任务上的优劣。\n- **新模型接入困难**：想测试最新的 API 模型或 vLLM 加速模型时，需反复修改底层代码适配接口，开发成本极高。\n\n### 使用 evalchemy 后\n- **一键统一环境**：通过统一的安装流程自动解决所有基准的依赖冲突，团队可在几分钟内完成环境搭建并立即开始测试。\n- **高效并行评估**：利用内置的数据并行功能，轻松调动多张 GPU 同时处理任务，将原本数天的评估时间压缩至几十分钟。\n- **标准化结果追踪**：自动生成标准格式的本地报告并支持数据库集成，清晰呈现模型在医疗、数学及代码任务上的得分雷达图。\n- **灵活模型支持**：仅需一条命令即可切换评估本地 vLLM 部署模型或云端 API 模型（如 Gemini、GPT-4o），无需任何代码改动。\n\nevalchemy 将繁琐的模型评估工程转化为标准化的自动化流程，让团队能专注于模型优化而非基础设施维护。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmlfoundations_evalchemy_a72fb652.png","mlfoundations","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmlfoundations_a182ca1f.png","",null,"https:\u002F\u002Fpeople.csail.mit.edu\u002Fludwigs\u002F","https:\u002F\u002Fgithub.com\u002Fmlfoundations",[82,86,90,94,98,102,106,110],{"name":83,"color":84,"percentage":85},"HTML","#e34c26",85.7,{"name":87,"color":88,"percentage":89},"Python","#3572A5",10.1,{"name":91,"color":92,"percentage":93},"JavaScript","#f1e05a",1.9,{"name":95,"color":96,"percentage":97},"CSS","#663399",1.6,{"name":99,"color":100,"percentage":101},"Shell","#89e051",0.5,{"name":103,"color":104,"percentage":105},"Jupyter Notebook","#DA5B0B",0.3,{"name":107,"color":108,"percentage":109},"Dockerfile","#384d54",0,{"name":111,"color":112,"percentage":109},"Makefile","#427819",585,79,"2026-03-23T22:06:53","Linux, macOS","评估本地模型时必需 NVIDIA GPU（支持多卡数据并行或模型并行）；支持 vLLM 加速引擎；评估 API 模型时无需本地 GPU。具体显存需求取决于模型大小，未明确最低要求。","未说明（建议根据模型大小配置，大模型并行评估需较大内存）",{"notes":120,"python":121,"dependencies":122},"1. 强烈建议使用 Conda 创建 Python 3.10 环境进行安装。\n2. 运行前需执行 'huggingface-cli login' 登录以下载数据集和模型。\n3. 若使用 LLM 作为裁判（如 AlpacaEval, MTBench），需设置 OPENAI_API_KEY 环境变量。\n4. 在部分 HPC 系统上安装时，可能需要修改 pyproject.toml 中的 fschat 依赖路径为绝对路径。\n5. 运行 BigCodeBench 基准测试时，强烈建议在 Docker 容器中进行，以防生成的代码破坏主机环境。\n6. 支持通过 Curator 调用各类 API 模型（包括 LiteLLM 支持的提供商）。","3.10",[123,124,125,126,127,128,129,130,131],"lm-evaluation-harness","vllm","accelerate","curator","litellm","fschat","huggingface_hub","torch","transformers",[26,13,54],"2026-03-27T02:49:30.150509","2026-04-06T08:09:01.938760",[136,141,146,151,156,161,166],{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},10276,"使用 pip 安装 evalchemy 时遇到依赖冲突（如 liger-kernel\u002Ftriton）或 Python 版本不兼容错误怎么办？","首先，项目已移除了 liger-kernel 依赖，请尝试重新安装。如果仍遇到 `ray[default]` 找不到版本的错误，通常是因为 Python 版本过高（如 3.12）。建议按照以下步骤操作：\n1. 使用 pyenv 将 Python 降级到 3.10 版本。\n2. 创建虚拟环境并激活：\n   python3 -m venv .venv --prompt evalchemy\n   source .venv\u002Fbin\u002Factivate\n3. 升级 pip 并安装：\n   pip install --upgrade pip\n   pip install -e \".[eval]\"\n   pip install -e eval\u002Fchat_benchmarks\u002Falpaca_eval","https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fevalchemy\u002Fissues\u002F39",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},10277,"运行 AIME25 任务时报错\"Tasks {'AIME25'} are not recognized\"如何解决？","这通常是由于本地代码未更新或 lm_eval 版本不匹配导致的。请按以下步骤排查：\n1. 确认源码中是否存在文件 `eval\u002Fchat_benchmarks\u002FAIME25\u002Feval_instruct.py`。如果不存在，请拉取最新的 evalchemy 代码。\n2. 如果文件存在，建议卸载 lm_eval 并重新安装 Evalchemy，以解决版本冲突问题。","https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fevalchemy\u002Fissues\u002F83",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},10278,"在使用 OpenAI Chat Completions 模型评估时，为什么设置的 batch_size 不生效（始终为 1）？","这是因为参数传递位置错误。`batch_size` 是评估框架的参数，而不是模型本身的参数。不要将其放在 `--model_args` 中，而应作为独立的命令行参数传递。正确用法示例：\npython -m eval.eval \\\n    --model hf \\\n    --tasks HumanEval,mmlu \\\n    --model_args \"pretrained=mistralai\u002FMistral-7B-Instruct-v0.3\" \\\n    --batch_size 2 \\\n    --output_path logs","https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fevalchemy\u002Fissues\u002F121",{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},10279,"LiveCodeBench 在迭代执行时发生崩溃或数据类型不一致错误怎么办？","这是由于数据集加载后未转换为列表格式或缺少类型验证导致的。解决方案是在代码中进行显式列表转换和类型检查：\n1. 将数据集转换为列表：examples = list(examples_dataset)\n2. 在遍历前添加类型验证：\n   for idx, example in enumerate(examples):\n       if not isinstance(example, dict):\n           self.logger.error(f\"Example {idx} is not dict: {type(example)}\")\n           continue\n       # 处理 example...","https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fevalchemy\u002Fissues\u002F136",{"id":157,"question_zh":158,"answer_zh":159,"source_url":160},10280,"为什么在使用 --debug 标志运行 AIME24、AIME25 等基准测试时会失败？","这些基准测试尚未实现调试模式的切片逻辑（即限制样本数量）。与其他已支持该功能的基准测试（如 MTBench）类似，需要在代码中添加如下逻辑来支持 debug 模式：\nif self.debug:\n    examples = examples[:5]  # 限制为前 5 个样本以便快速测试\n这将允许在调试模式下仅运行少量样本，从而加快开发迭代速度并降低计算成本。","https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fevalchemy\u002Fissues\u002F134",{"id":162,"question_zh":163,"answer_zh":164,"source_url":165},10281,"WildBench 评估速度非常慢，是否有办法加速或缓存结果？","针对 OpenAI 调用耗时过长的问题，社区推荐使用 curator 工具来优化性能和处理缓存。你可以参考该项目：https:\u002F\u002Fgithub.com\u002Fbespokelabsai\u002Fcurator，它专门用于此类场景的加速和结果管理。","https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fevalchemy\u002Fissues\u002F25",{"id":167,"question_zh":168,"answer_zh":169,"source_url":170},10282,"如何复现 DeepSeek-R1 论文中提到的 Codeforces 基准测试结果？","目前可以通过 Hugging Face 上的 open-r1 数据集进行测试分割评估。具体数据集地址为：https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopen-r1\u002Fcodeforces\u002Fviewer\u002Fdefault\u002Ftest。如果需要更详细的评估标准，需进一步确认是否存在统一的 Codeforces 基准评估方式。","https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fevalchemy\u002Fissues\u002F101",[]]