[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-groq--openbench":3,"tool-groq--openbench":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",147882,2,"2026-04-09T11:32:47",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108111,"2026-04-08T11:23:26",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":77,"owner_email":78,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":32,"env_os":98,"env_gpu":99,"env_ram":100,"env_deps":101,"category_tags":107,"github_topics":109,"view_count":32,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":111,"updated_at":112,"faqs":113,"releases":144},5981,"groq\u002Fopenbench","openbench","Provider-agnostic, open-source evaluation infrastructure for language models","openbench 是一个开源、中立的大语言模型评估基础设施，旨在为开发者及研究人员提供标准化且可复现的基准测试方案。它解决了当前大模型评估中存在的痛点：不同厂商接口各异导致测试困难、私有数据难以在安全环境下验证，以及缺乏统一的对比标准。\n\n无论是需要快速验证模型性能的工程师，还是追求严谨实验数据的研究人员，亦或是关注数据隐私的企业用户，都能通过 openbench 高效开展工作。其核心亮点在于“提供商无关性”，原生支持包括 Groq、OpenAI、Anthropic、Google 及本地 Ollama 等 30 多种模型来源，用户只需配置相应密钥即可无缝切换。工具内置了 MMLU、HumanEval 等 95+ 个权威评测集，覆盖数学推理、代码生成、科学常识等多个维度，并允许用户轻松添加自定义本地评测以保障数据隐私。\n\n基于行业标准的 inspect-ai 框架构建，openbench 提供了极简的命令行体验，用户可在分钟内完成从环境搭建到结果可视化的全流程，甚至能将评估结果直接推送至 Hugging Face。它让大模型的性能对比变得像运行一条简单命令般轻松，是探索和优化模型能力的","openbench 是一个开源、中立的大语言模型评估基础设施，旨在为开发者及研究人员提供标准化且可复现的基准测试方案。它解决了当前大模型评估中存在的痛点：不同厂商接口各异导致测试困难、私有数据难以在安全环境下验证，以及缺乏统一的对比标准。\n\n无论是需要快速验证模型性能的工程师，还是追求严谨实验数据的研究人员，亦或是关注数据隐私的企业用户，都能通过 openbench 高效开展工作。其核心亮点在于“提供商无关性”，原生支持包括 Groq、OpenAI、Anthropic、Google 及本地 Ollama 等 30 多种模型来源，用户只需配置相应密钥即可无缝切换。工具内置了 MMLU、HumanEval 等 95+ 个权威评测集，覆盖数学推理、代码生成、科学常识等多个维度，并允许用户轻松添加自定义本地评测以保障数据隐私。\n\n基于行业标准的 inspect-ai 框架构建，openbench 提供了极简的命令行体验，用户可在分钟内完成从环境搭建到结果可视化的全流程，甚至能将评估结果直接推送至 Hugging Face。它让大模型的性能对比变得像运行一条简单命令般轻松，是探索和优化模型能力的得力助手。","# openbench\n\n\u003Ch2 align=\"center\">\n \u003Cbr>\n \u003Cimg src=\"docs\u002Flogo\u002Fdark.svg\" alt=\"openbench\" width=\"250\">\n \u003Cbr>\n \u003Cbr>\nProvider-agnostic, open-source evaluation infrastructure for language models\n \u003Cbr>\n\u003C\u002Fh2>\n\n\u003Cp align=\"center\">\n \u003Ca href=\"https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fopenbench\">\u003Cimg src=\"https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fopenbench.svg\">\n \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fblob\u002Fmain\u002FLICENSE.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg\">\n \u003Ca href=\"https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10+-blue.svg\">\n \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fstargazers\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgroq\u002Fopenbench\">\u003C\u002Fa>\n \u003C\u002Fa>\n\u003C\u002Fp>\n\n\nopenbench provides standardized, reproducible benchmarking for LLMs across 30+ evaluation suites (and growing) spanning knowledge, math, reasoning, coding, science, reading comprehension, health, long-context recall, graph reasoning, and first-class support for your own local evals to preserve privacy. **Works with any model provider** - Groq, OpenAI, Anthropic, Cohere, Google, AWS Bedrock, Azure, local models via Ollama, Hugging Face, and 30+ other providers.\n\nTo get started, see the tutorial below or reference the [docs](https:\u002F\u002Fopenbench.dev\u002F).\n\n## Features\n\n- **🎯 95+ Benchmarks**: MMLU, GPQA, HumanEval, SimpleQA, competition math (AIME, HMMT), SciCode, GraphWalks, and more\n- **🔧 Simple CLI**: `bench list`, `bench describe`, `bench eval` (also available as `openbench`), `-M`\u002F`-T` flags for model\u002Ftask args, `--debug` mode for eval-retry, experimental benchmarks with `--alpha` flag\n- **🏗️ Built on inspect-ai**: Industry-standard evaluation framework\n- **📊 Extensible**: Easy to add new benchmarks and metrics\n- **🤖 Provider-agnostic**: Works with 30+ model providers out of the box\n- **🛠️ Local Eval Support**: Privatized benchmarks can be run with `bench eval \u003Cpath>`\n- **📤 Hugging Face Integration**: Push evaluation results directly to Hugging Face datasets\n\n## 🏃 Speedrun: Evaluate a Model in 60 Seconds\n\n**Prerequisite**: [Install uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002Fgetting-started\u002Finstallation\u002F)\n\n```bash\n# Create a virtual environment and install openbench (30 seconds)\nuv venv\nsource .venv\u002Fbin\u002Factivate\nuv pip install openbench\n\n# Set your API key (any provider!)\nexport GROQ_API_KEY=your_key  # or OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.\n\n# Run your first eval (3 seconds)\nbench eval mmlu --model groq\u002Fopenai\u002Fgpt-oss-120b --limit 10\n\n# That's it! 🎉 Check results in .\u002Flogs\u002F or view them in an interactive UI:\nbench view\n```\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fe99e4628-f1f5-48e4-9df2-ae28b86168c2\n\n## Supported Providers\n\nopenbench supports 30+ model providers through Inspect AI. Set the appropriate API key environment variable and you're ready to go:\n\n| Provider              | Environment Variable   | Example Model String             |\n| --------------------- | ---------------------- | -------------------------------- |\n| **AI21 Labs**         | `AI21_API_KEY`         | `ai21\u002Fmodel-name`                |\n| **Anthropic**         | `ANTHROPIC_API_KEY`    | `anthropic\u002Fmodel-name`           |\n| **AWS Bedrock**       | AWS credentials        | `bedrock\u002Fmodel-name`             |\n| **Azure**             | `AZURE_OPENAI_API_KEY` | `azure\u002F\u003Cdeployment-name>`        |\n| **Baseten**           | `BASETEN_API_KEY`      | `baseten\u002Fmodel-name`             |\n| **Cerebras**          | `CEREBRAS_API_KEY`     | `cerebras\u002Fmodel-name`            |\n| **Cohere**            | `COHERE_API_KEY`       | `cohere\u002Fmodel-name`              |\n| **Crusoe**            | `CRUSOE_API_KEY`       | `crusoe\u002Fmodel-name`              |\n| **DeepInfra**         | `DEEPINFRA_API_KEY`    | `deepinfra\u002Fmodel-name`           |\n| **Friendli**          | `FRIENDLI_TOKEN`       | `friendli\u002Fmodel-name`            |\n| **Google**            | `GOOGLE_API_KEY`       | `google\u002Fmodel-name`              |\n| **Groq**              | `GROQ_API_KEY`         | `groq\u002Fmodel-name`                |\n| **Helicone**          | `HELICONE_API_KEY`     | `helicone\u002Fmodel-name`            |\n| **Hugging Face**      | `HF_TOKEN`             | `huggingface\u002Fmodel-name`         |\n| **Hyperbolic**        | `HYPERBOLIC_API_KEY`   | `hyperbolic\u002Fmodel-name`          |\n| **Lambda**            | `LAMBDA_API_KEY`       | `lambda\u002Fmodel-name`              |\n| **MiniMax**           | `MINIMAX_API_KEY`      | `minimax\u002Fmodel-name`             |\n| **Mistral**           | `MISTRAL_API_KEY`      | `mistral\u002Fmodel-name`             |\n| **Moonshot**          | `MOONSHOT_API_KEY`     | `moonshot\u002Fmodel-name`            |\n| **Nebius**            | `NEBIUS_API_KEY`       | `nebius\u002Fmodel-name`              |\n| **Nous Research**     | `NOUS_API_KEY`         | `nous\u002Fmodel-name`                |\n| **Novita AI**         | `NOVITA_API_KEY`       | `novita\u002Fmodel-name`              |\n| **Ollama**            | None (local)           | `ollama\u002Fmodel-name`              |\n| **OpenAI**            | `OPENAI_API_KEY`       | `openai\u002Fmodel-name`              |\n| **OpenRouter**        | `OPENROUTER_API_KEY`   | `openrouter\u002Fmodel-name`          |\n| **Parasail**          | `PARASAIL_API_KEY`     | `parasail\u002Fmodel-name`            |\n| **Perplexity**        | `PERPLEXITY_API_KEY`   | `perplexity\u002Fmodel-name`          |\n| **Reka**              | `REKA_API_KEY`         | `reka\u002Fmodel-name`                |\n| **SambaNova**         | `SAMBANOVA_API_KEY`    | `sambanova\u002Fmodel-name`           |\n| **SiliconFlow**       | `SILICONFLOW_API_KEY`  | `siliconflow\u002Fmodel-name`         |\n| **Together AI**       | `TOGETHER_API_KEY`     | `together\u002Fmodel-name`            |\n| **Vercel AI Gateway** | `AI_GATEWAY_API_KEY`   | `vercel\u002Fcreator-name\u002Fmodel-name` |\n| **W&B Inference**     | `WANDB_API_KEY`        | `wandb\u002Fmodel-name`               |\n| **vLLM**              | None (local)           | `vllm\u002Fmodel-name`                |\n\n## Available Benchmarks\n\nSee the [Benchmarks Catalog](https:\u002F\u002Fopenbench.dev\u002Fbenchmarks\u002Fcatalog) or use `bench list`.\n\n\n## Commands and Options\n\nFor a complete list of all commands and options, run: `bench --help`\nSee the [docs](https:\u002F\u002Fopenbench.dev\u002Fcli\u002Foverview) for more details. \n\n| Command                  | Description                                        |\n| ------------------------ | -------------------------------------------------- |\n| `bench list`             | List available benchmarks                          |\n| `bench eval \u003Cbenchmark>` | Run benchmark evaluation                           |\n| `bench eval-retry \u003Clog_files>`  | Retry a failed evaluation                          |\n| `bench view`             | Interactive UI to view benchmark logs             |\n| `bench cache \u003Cinfo\u002Fls\u002Fclear\u002Fupload>`            | Manage OpenBench caches            |\n\n\n### Common `eval` Configuration Options\n| Option               | Environment Variable      | Default                   | Description                                                      |\n| -------------------- |---------------------------|---------------------------|------------------------------------------------------------------|\n| `-M \u003Cargs>`          | `None`                      | `None`                      | Pass provider\u002Fmodel-specific arguments (e.g., `-M only=groq`)    |\n| `-T \u003Cargs>`          | `None`                      | `None`                      | Pass task-specific arguments to the benchmark                    |\n| `--model`            | `BENCH_MODEL`             | `groq\u002Fopenai\u002Fgpt-oss-20b` | Model(s) to evaluate                                             |\n| `--epochs`           | `BENCH_EPOCHS`            | `1`                       | Number of epochs to run each evaluation                          |\n| `--epochs-reducer`   | `BENCH_EPOCHS_REDUCER`    | `None`                      | Reducer(s) applied when aggregating epoch scores                |\n| `--max-connections`  | `BENCH_MAX_CONNECTIONS`   | `10`                      | Maximum parallel requests to model                               |\n| `--temperature`      | `BENCH_TEMPERATURE`       | `0.6`                     | Model temperature                                                |\n| `--top-p`            | `BENCH_TOP_P`             | `1.0`                     | Model top-p                                                      |\n| `--max-tokens`       | `BENCH_MAX_TOKENS`        | `None`                    | Maximum tokens for model response                                |\n| `--seed`             | `BENCH_SEED`              | `None`                    | Seed for deterministic generation                                |\n| `--limit`            | `BENCH_LIMIT`             | `None`                    | Limit evaluated samples (number or start,end)                    |\n| `--logfile`          | `BENCH_OUTPUT`            | `None`                    | Output file for results                                          |\n| `--sandbox`          | `BENCH_SANDBOX`           | `None`                    | Environment to run evaluation (local\u002Fdocker)                     |\n| `--timeout`          | `BENCH_TIMEOUT`           | `10000`                   | Timeout for each API request (seconds)                           |\n| `--fail-on-error`          | `None`           | `1`                   | Threshold of allowable sample errors (use an integer for count or a float for proportion) |\n| `--display`          | `BENCH_DISPLAY`           | `None`                    | Display type (full\u002Fconversation\u002Frich\u002Fplain\u002Fnone)                 |\n| `--reasoning-effort` | `BENCH_REASONING_EFFORT`  | `None`                    | Reasoning effort level (low\u002Fmedium\u002Fhigh)                         |\n| `--json`             | `None`                      | `False`                   | Output results in JSON format                                    |\n| `--log-format`       | `BENCH_LOG_FORMAT`        | `eval`                    | Output logging format (eval\u002Fjson)                                |\n| `--hub-repo`         | `BENCH_HUB_REPO`          | `None`                    | Push results to a Hugging Face Hub dataset                       |\n| `--keep-livemcp-root` | `BENCH_KEEP_LIVEMCP_ROOT` | `False`                   | Allow preservation of root data after livemcpbench eval runs     |\n| `--code-agent`       | `BENCH_CODE_AGENT`        | `codex`                   | Select code agent for Exercism tasks (codex\u002Faider\u002Fopencode\u002Fclaude_code\u002Froo) |\n| `--hidden-tests`     | `BENCH_HIDDEN_TESTS`      | `False`                   | Run Exercism agents with hidden tests |\n\n\n## Development and Building Your Own Evals\n\nFor a full guide, see [Contributing Guidelines](CONTRIBUTING.md) and [Extending openbench](https:\u002F\u002Fopenbench.dev\u002Fdevelopment\u002Fextending). Also, check out Inspect AI's excellent [documentation](https:\u002F\u002Finspect.aisi.org.uk\u002F).\n\n### Quick Eval: Run from Path\n\nFor one-off or private evaluations, point openbench directly at your eval:\n\n```bash\nbench eval \u002Fpath\u002Fto\u002Fmy_eval.py --model groq\u002Fllama-3.3-70b-versatile\n```\n\n### Plugin System: Distribute as Packages\n\nopenbench supports a **plugin system via Python entry points**. Package your benchmarks and distribute them independently:\n\n```toml\n# pyproject.toml\n[project.entry-points.\"openbench.benchmarks\"]\nmy_benchmark = \"my_pkg.metadata:get_benchmark_metadata\"\n```\n\nAfter `pip install my-benchmark-package`, your benchmark appears in `bench list` and works with all CLI commands. Perfect for:\n- Sharing benchmarks across teams\n- Versioning evaluations independently\n- Overriding built-in benchmarks with custom implementations\n\n## FAQ\n\n### How does openbench differ from Inspect AI?\n\nopenbench provides:\n\n- **Reference implementations** of 20+ major benchmarks with consistent interfaces\n- **Shared utilities** for common patterns (math scoring, multi-language support, etc.)\n- **Curated scorers** that work across different eval types\n- **CLI tooling** optimized for running standardized benchmarks\n\nThink of it as a benchmark library built on Inspect's excellent foundation.\n\n### Why not just use Inspect AI, lm-evaluation-harness, or lighteval?\n\nDifferent tools for different needs! openbench focuses on:\n\n- **Shared components**: Common scorers, solvers, and datasets across benchmarks reduce code duplication\n- **Clean implementations**: Each eval is written for readability and reliability\n- **Developer experience**: Simple CLI, consistent patterns, easy to extend\n\nWe built openbench because we needed evaluation code that was easy to understand, modify, and trust. It's a curated set of benchmarks built on Inspect AI's excellent foundation.\n\n### How can I run `bench` outside of the `uv` environment?\n\nIf you want `bench` to be available outside of `uv`, you can run the following command:\n\n```bash\nuv run pip install -e .\n```\n\n### I'm running into an issue when downloading a dataset from HuggingFace - how do I fix it?\n\nSome evaluations may require logging into HuggingFace to download the dataset. If `bench` prompts you to do so, or throws \"gated\" errors,\ndefining the environment variable\n\n```bash\nHF_TOKEN=\"\u003CHUGGINGFACE_TOKEN>\"\n```\n\nshould fix the issue. The full HuggingFace documentation can be found [on the HuggingFace docs on Authentication](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhub\u002Fen\u002Fdatasets-polars-auth).\n\nSee the docs for further [Tips and Troubleshooting](https:\u002F\u002Fopenbench.dev\u002Ftroubleshooting).\n\n## 🚧 Alpha Release\n\nWe're building in public! This is an alpha release - expect rapid iteration. The first stable release is coming soon.\n\nQuick links:\n\n- [Report a bug](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002Fnew?assignees=&labels=bug&projects=&template=bug_report.yml)\n- [Request a feature](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002Fnew?assignees=&labels=enhancement&projects=&template=feature_request.yml)\n\n## Reproducibility Statement\n\nAs the authors of openbench, we strive to implement this tool's evaluations as faithfully as possible with respect to the original benchmarks themselves.\n\nHowever, it is expected that developers may observe numerical discrepancies between openbench's scores and the reported scores from other sources.\n\nThese numerical differences can be attributed to many reasons, including (but not limited to) minor variations in the model prompts, different model quantization or inference approaches, and repurposing benchmarks to be compatible with the packages used to develop openbench.\n\nAs a result, openbench results are meant to be compared with openbench results, not as a universal one-to-one comparison with every external result. For meaningful comparisons, ensure you are using the same version of openbench.\n\nWe encourage developers to identify areas of improvement and we welcome open source contributions to openbench.\n\n## Acknowledgments\n\nThis project would not be possible without:\n\n- **[Inspect AI](https:\u002F\u002Fgithub.com\u002FUKGovernmentBEIS\u002Finspect_ai)** - The incredible evaluation framework that powers openbench\n- **[EleutherAI's lm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness)** - Pioneering work in standardized LLM evaluation\n- **[Hugging Face's lighteval](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Flighteval)** - Excellent evaluation infrastructure\n\n## Citation\n\n```bibtex\n@software{openbench,\n  title = {openbench: Provider-agnostic, open-source evaluation infrastructure for language models},\n  author = {Sah, Aarush},\n  year = {2025},\n  url = {https:\u002F\u002Fopenbench.dev}\n}\n```\n\n## License\n\nMIT\n\n---\n\nBuilt with ❤️ by [Aarush Sah](https:\u002F\u002Fgithub.com\u002FAarushSah) and the [Groq](https:\u002F\u002Fgroq.com) team\n","# openbench\n\n\u003Ch2 align=\"center\">\n \u003Cbr>\n \u003Cimg src=\"docs\u002Flogo\u002Fdark.svg\" alt=\"openbench\" width=\"250\">\n \u003Cbr>\n \u003Cbr>\n与提供商无关的开源语言模型评估基础设施\n \u003Cbr>\n\u003C\u002Fh2>\n\n\u003Cp align=\"center\">\n \u003Ca href=\"https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fopenbench\">\u003Cimg src=\"https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fopenbench.svg\">\n \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fblob\u002Fmain\u002FLICENSE.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg\">\n \u003Ca href=\"https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10+-blue.svg\">\n \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fstargazers\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgroq\u002Fopenbench\">\u003C\u002Fa>\n \u003C\u002Fa>\n\u003C\u002Fp>\n\n\nopenbench 为 LLM 提供标准化、可复现的基准测试，覆盖 30 多个评估套件（且仍在增加），涵盖知识、数学、推理、编程、科学、阅读理解、健康、长上下文回忆、图推理等领域，并且原生支持您自己的本地评估，以保护隐私。**适用于任何模型提供商**——Groq、OpenAI、Anthropic、Cohere、Google、AWS Bedrock、Azure、通过 Ollama 使用的本地模型、Hugging Face 等 30 多家提供商。\n\n要开始使用，请参阅下方教程或参考 [文档](https:\u002F\u002Fopenbench.dev\u002F)。\n\n## 特性\n\n- **🎯 95+ 基准测试**：MMLU、GPQA、HumanEval、SimpleQA、竞赛数学（AIME、HMMT）、SciCode、GraphWalks 等\n- **🔧 简单的命令行界面**：`bench list`、`bench describe`、`bench eval`（也可作为 `openbench` 使用），带有 `-M`\u002F`-T` 标志用于指定模型和任务参数，`--debug` 模式用于重试评估，实验性基准测试可通过 `--alpha` 标志启用\n- **🏗️ 构建于 inspect-ai 之上**：行业标准的评估框架\n- **📊 可扩展性**：易于添加新的基准测试和指标\n- **🤖 与提供商无关**：开箱即用，支持 30 多家模型提供商\n- **🛠️ 本地评估支持**：可通过 `bench eval \u003Cpath>` 运行私有化基准测试\n- **📤 Hugging Face 集成**：可将评估结果直接推送到 Hugging Face 数据集\n\n## 🏃 速成：60 秒内评估一个模型\n\n**前提条件**：[安装 uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002Fgetting-started\u002Finstallation\u002F)\n\n```bash\n# 创建虚拟环境并安装 openbench（30 秒）\nuv venv\nsource .venv\u002Fbin\u002Factivate\nuv pip install openbench\n\n# 设置您的 API 密钥（任何提供商！）\nexport GROQ_API_KEY=your_key  # 或 OPENAI_API_KEY、ANTHROPIC_API_KEY 等\n\n# 运行您的第一个评估（3 秒）\nbench eval mmlu --model groq\u002Fopenai\u002Fgpt-oss-120b --limit 10\n\n# 就这样！🎉 您可以在 .\u002Flogs\u002F 中查看结果，或通过交互式 UI 查看：\nbench view\n```\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fe99e4628-f1f5-48e4-9df2-ae28b86168c2\n\n## 支持的提供商\n\nopenbench 通过 Inspect AI 支持 30 多家模型提供商。只需设置相应的 API 密钥环境变量，即可开始使用：\n\n| 提供商              | 环境变量   | 示例模型字符串             |\n| --------------------- | ---------------------- | -------------------------------- |\n| **AI21 Labs**         | `AI21_API_KEY`         | `ai21\u002Fmodel-name`                |\n| **Anthropic**         | `ANTHROPIC_API_KEY`    | `anthropic\u002Fmodel-name`           |\n| **AWS Bedrock**       | AWS 凭证        | `bedrock\u002Fmodel-name`             |\n| **Azure**             | `AZURE_OPENAI_API_KEY` | `azure\u002F\u003Cdeployment-name>`        |\n| **Baseten**           | `BASETEN_API_KEY`      | `baseten\u002Fmodel-name`             |\n| **Cerebras**          | `CEREBRAS_API_KEY`     | `cerebras\u002Fmodel-name`            |\n| **Cohere**            | `COHERE_API_KEY`       | `cohere\u002Fmodel-name`              |\n| **Crusoe**            | `CRUSOE_API_KEY`       | `crusoe\u002Fmodel-name`              |\n| **DeepInfra**         | `DEEPINFRA_API_KEY`    | `deepinfra\u002Fmodel-name`           |\n| **Friendli**          | `FRIENDLI_TOKEN`       | `friendli\u002Fmodel-name`            |\n| **Google**            | `GOOGLE_API_KEY`       | `google\u002Fmodel-name`              |\n| **Groq**              | `GROQ_API_KEY`         | `groq\u002Fmodel-name`                |\n| **Helicone**          | `HELICONE_API_KEY`     | `helicone\u002Fmodel-name`            |\n| **Hugging Face**      | `HF_TOKEN`             | `huggingface\u002Fmodel-name`         |\n| **Hyperbolic**        | `HYPERBOLIC_API_KEY`   | `hyperbolic\u002Fmodel-name`          |\n| **Lambda**            | `LAMBDA_API_KEY`       | `lambda\u002Fmodel-name`              |\n| **MiniMax**           | `MINIMAX_API_KEY`      | `minimax\u002Fmodel-name`             |\n| **Mistral**           | `MISTRAL_API_KEY`      | `mistral\u002Fmodel-name`             |\n| **Moonshot**          | `MOONSHOT_API_KEY`     | `moonshot\u002Fmodel-name`            |\n| **Nebius**            | `NEBIUS_API_KEY`       | `nebius\u002Fmodel-name`              |\n| **Nous Research**     | `NOUS_API_KEY`         | `nous\u002Fmodel-name`                |\n| **Novita AI**         | `NOVITA_API_KEY`       | `novita\u002Fmodel-name`              |\n| **Ollama**            | 无（本地）           | `ollama\u002Fmodel-name`              |\n| **OpenAI**            | `OPENAI_API_KEY`       | `openai\u002Fmodel-name`              |\n| **OpenRouter**        | `OPENROUTER_API_KEY`   | `openrouter\u002Fmodel-name`          |\n| **Parasail**          | `PARASAIL_API_KEY`     | `parasail\u002Fmodel-name`            |\n| **Perplexity**        | `PERPLEXITY_API_KEY`   | `perplexity\u002Fmodel-name`          |\n| **Reka**              | `REKA_API_KEY`         | `reka\u002Fmodel-name`                |\n| **SambaNova**         | `SAMBANOVA_API_KEY`    | `sambanova\u002Fmodel-name`           |\n| **SiliconFlow**       | `SILICONFLOW_API_KEY`  | `siliconflow\u002Fmodel-name`         |\n| **Together AI**       | `TOGETHER_API_KEY`     | `together\u002Fmodel-name`            |\n| **Vercel AI Gateway** | `AI_GATEWAY_API_KEY`   | `vercel\u002Fcreator-name\u002Fmodel-name` |\n| **W&B Inference**     | `WANDB_API_KEY`        | `wandb\u002Fmodel-name`               |\n| **vLLM**              | 无（本地）           | `vllm\u002Fmodel-name`                |\n\n## 可用的基准测试\n\n请参阅 [基准测试目录](https:\u002F\u002Fopenbench.dev\u002Fbenchmarks\u002Fcatalog) 或使用 `bench list`。\n\n\n## 命令与选项\n\n如需查看所有命令和选项的完整列表，请运行：`bench --help`\n更多详细信息请参阅 [文档](https:\u002F\u002Fopenbench.dev\u002Fcli\u002Foverview)。\n\n| 命令                  | 描述                                        |\n| ------------------------ | -------------------------------------------------- |\n| `bench list`             | 列出可用的基准测试                          |\n| `bench eval \u003Cbenchmark>` | 运行基准测试评估                           |\n| `bench eval-retry \u003Clog_files>`  | 重试失败的评估                          |\n| `bench view`             | 交互式 UI 用于查看基准测试日志             |\n| `bench cache \u003Cinfo\u002Fls\u002Fclear\u002Fupload>`            | 管理 OpenBench 缓存            |\n\n### 常用 `eval` 配置选项\n| 选项               | 环境变量      | 默认值                   | 描述                                                      |\n| -------------------- |---------------------------|---------------------------|------------------------------------------------------------------|\n| `-M \u003Cargs>`          | `无`                      | `无`                      | 传递提供者\u002F模型特定的参数（例如 `-M only=groq`）    |\n| `-T \u003Cargs>`          | `无`                      | `无`                      | 将任务特定的参数传递给基准测试                    |\n| `--model`            | `BENCH_MODEL`             | `groq\u002Fopenai\u002Fgpt-oss-20b` | 要评估的模型                                             |\n| `--epochs`           | `BENCH_EPOCHS`            | `1`                       | 每次评估运行的轮数                                          |\n| `--epochs-reducer`   | `BENCH_EPOCHS_REDUCER`    | `无`                      | 在聚合每轮得分时应用的规约函数                |\n| `--max-connections`  | `BENCH_MAX_CONNECTIONS`   | `10`                      | 对模型的最大并行请求数                               |\n| `--temperature`      | `BENCH_TEMPERATURE`       | `0.6`                     | 模型温度                                                |\n| `--top-p`            | `BENCH_TOP_P`             | `1.0`                     | 模型 top-p                                              |\n| `--max-tokens`       | `BENCH_MAX_TOKENS`        | `无`                    | 模型响应的最大 token 数量                                |\n| `--seed`             | `BENCH_SEED`              | `无`                    | 用于确定性生成的随机种子                                 |\n| `--limit`            | `BENCH_LIMIT`             | `无`                    | 限制评估样本的数量或指定起止范围                          |\n| `--logfile`          | `BENCH_OUTPUT`            | `无`                    | 结果输出文件                                          |\n| `--sandbox`          | `BENCH_SANDBOX`           | `无`                    | 运行评估的环境（本地\u002FDocker）                     |\n| `--timeout`          | `BENCH_TIMEOUT`           | `10000`                   | 每个 API 请求的超时时间（秒）                           |\n| `--fail-on-error`          | `无`           | `1`                   | 允许的样本错误阈值（使用整数表示数量，或浮点数表示比例） |\n| `--display`          | `BENCH_DISPLAY`           | `无`                    | 显示类型（完整\u002F对话\u002F丰富\u002F简单\u002F无）                 |\n| `--reasoning-effort` | `BENCH_REASONING_EFFORT`  | `无`                    | 推理努力程度（低\u002F中\u002F高）                         |\n| `--json`             | `无`                      | `False`                   | 以 JSON 格式输出结果                                    |\n| `--log-format`       | `BENCH_LOG_FORMAT`        | `eval`                    | 输出日志格式（eval\u002Fjson）                                |\n| `--hub-repo`         | `BENCH_HUB_REPO`          | `无`                    | 将结果推送到 Hugging Face Hub 数据集                       |\n| `--keep-livemcp-root` | `BENCH_KEEP_LIVEMCP_ROOT` | `False`                   | 允许在 livemcpbench 评估完成后保留根数据     |\n| `--code-agent`       | `BENCH_CODE_AGENT`        | `codex`                   | 为 Exercism 任务选择代码代理（codex\u002Faider\u002Fopencode\u002Fclaude_code\u002Froo） |\n| `--hidden-tests`     | `BENCH_HIDDEN_TESTS`      | `False`                   | 使用隐藏测试运行 Exercism 代理 |\n\n\n## 开发与构建自定义评估\n\n有关完整指南，请参阅 [贡献指南](CONTRIBUTING.md) 和 [扩展 openbench](https:\u002F\u002Fopenbench.dev\u002Fdevelopment\u002Fextending)。此外，还可查看 Inspect AI 的优秀 [文档](https:\u002F\u002Finspect.aisi.org.uk\u002F)。\n\n### 快速评估：从路径运行\n\n对于一次性或私有评估，可直接将 openbench 指向您的评估脚本：\n\n```bash\nbench eval \u002Fpath\u002Fto\u002Fmy_eval.py --model groq\u002Fllama-3.3-70b-versatile\n```\n\n### 插件系统：以包形式分发\n\nopenbench 支持通过 Python 入口点实现的插件系统。您可以将自己的基准测试打包并独立分发：\n\n```toml\n# pyproject.toml\n[project.entry-points.\"openbench.benchmarks\"]\nmy_benchmark = \"my_pkg.metadata:get_benchmark_metadata\"\n```\n\n在执行 `pip install my-benchmark-package` 后，您的基准测试将出现在 `bench list` 中，并可与所有 CLI 命令配合使用。这非常适合：\n- 在团队间共享基准测试\n- 独立管理评估版本\n- 用自定义实现覆盖内置基准测试\n\n## 常见问题解答\n\n### openbench 与 Inspect AI 有何不同？\n\nopenbench 提供：\n- 20 多种主要基准测试的参考实现，具有统一的接口\n- 用于常见模式的共享工具（数学评分、多语言支持等）\n- 经过精心挑选的评分器，可在不同类型的评估中通用\n- 针对运行标准化基准测试优化的 CLI 工具\n\n可以将其视为建立在 Inspect 出色基础之上的一套基准测试库。\n\n### 为什么不直接使用 Inspect AI、lm-evaluation-harness 或 lighteval？\n\n不同的工具有不同的用途！openbench 专注于：\n- 共享组件：跨基准测试的通用评分器、求解器和数据集，减少代码重复\n- 清晰的实现：每个评估都注重可读性和可靠性\n- 开发者体验：简洁的 CLI、一致的模式，易于扩展\n\n我们构建 openbench 是因为我们需要易于理解、修改且值得信赖的评估代码。它是一套基于 Inspect AI 优秀基础的精选基准测试。\n\n### 如何在 `uv` 环境之外运行 `bench`？\n\n如果您希望 `bench` 在 `uv` 环境之外也可用，可以运行以下命令：\n\n```bash\nuv run pip install -e .\n```\n\n### 我在从 HuggingFace 下载数据集时遇到问题，该如何解决？\n\n某些评估可能需要登录 HuggingFace 才能下载数据集。如果 `bench` 提示您进行登录，或抛出“受控”错误，\n设置环境变量\n\n```bash\nHF_TOKEN=\"\u003CHUGGINGFACE_TOKEN>\"\n```\n\n应该可以解决问题。完整的 HuggingFace 文档可以在 [HuggingFace 的身份验证文档](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhub\u002Fen\u002Fdatasets-polars-auth) 中找到。\n\n更多提示和故障排除信息，请参阅 [故障排除文档](https:\u002F\u002Fopenbench.dev\u002Ftroubleshooting)。\n\n## 🚧 Alpha 版本\n\n我们正在公开开发！这是一个 Alpha 版本，预计会快速迭代。首个稳定版本即将发布。\n\n快捷链接：\n\n- [报告 bug](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002Fnew?assignees=&labels=bug&projects=&template=bug_report.yml)\n- [请求功能](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002Fnew?assignees=&labels=enhancement&projects=&template=feature_request.yml)\n\n## 可复现性声明\n\n作为 openbench 的作者，我们致力于尽可能忠实地实现该工具的评估，使其与原始基准测试本身保持一致。\n\n然而，开发者可能会观察到 openbench 的得分与其他来源报告的得分之间存在数值差异。\n\n这些数值差异可能由多种原因造成，包括但不限于模型提示的细微变化、不同的模型量化或推理方法，以及为使基准测试与开发 openbench 所使用的软件包兼容而进行的调整。\n\n因此，openbench 的结果应仅与 openbench 的其他结果进行比较，而不应被视为可与所有外部结果一一对应的通用标准。为了进行有意义的比较，请确保使用相同版本的 openbench。\n\n我们鼓励开发者识别改进空间，并欢迎对 openbench 进行开源贡献。\n\n## 致谢\n\n本项目离不开以下支持：\n\n- **[Inspect AI](https:\u002F\u002Fgithub.com\u002FUKGovernmentBEIS\u002Finspect_ai)** - 提供 openbench 核心功能的卓越评估框架\n- **[EleutherAI 的 lm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness)** - 在标准化大语言模型评估领域开创性的工作\n- **[Hugging Face 的 lighteval](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Flighteval)** - 优秀的评估基础设施\n\n## 引用\n\n```bibtex\n@software{openbench,\n  title = {openbench：面向语言模型的提供商无关、开源评估基础设施},\n  author = {Sah, Aarush},\n  year = {2025},\n  url = {https:\u002F\u002Fopenbench.dev}\n}\n```\n\n## 许可证\n\nMIT\n\n---\n\n由 [Aarush Sah](https:\u002F\u002Fgithub.com\u002FAarushSah) 和 [Groq](https:\u002F\u002Fgroq.com) 团队用心打造","# OpenBench 快速上手指南\n\nOpenBench 是一个与提供商无关的开源大语言模型（LLM）评估基础设施。它基于 `inspect-ai` 构建，支持 30+ 主流模型提供商（如 Groq, OpenAI, Anthropic, 本地 Ollama\u002FvLLM 等），提供标准化、可复现的基准测试能力，涵盖知识、数学、推理、编程等多个领域。\n\n## 环境准备\n\n*   **操作系统**：Linux, macOS, Windows (WSL 推荐)\n*   **Python 版本**：3.10 或更高\n*   **前置依赖**：推荐使用 [`uv`](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) 进行极速环境管理（也可使用 pip）。\n    *   安装 uv (官方脚本):\n        ```bash\n        curl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n        ```\n    *   *国内加速提示*：如果下载缓慢，可尝试配置国内镜像源或在后续 pip 安装步骤中指定 `-i` 参数。\n\n## 安装步骤\n\n使用 `uv` 创建虚拟环境并安装 OpenBench（全程约 30 秒）：\n\n```bash\n# 1. 创建虚拟环境\nuv venv\n\n# 2. 激活环境\n# Linux\u002FmacOS:\nsource .venv\u002Fbin\u002Factivate\n# Windows (PowerShell):\n.venv\\Scripts\\Activate.ps1\n\n# 3. 安装 openbench\nuv pip install openbench\n```\n\n> **注**：若使用 `pip` 安装且需国内加速，可使用：\n> `pip install openbench -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n\n## 基本使用\n\n### 1. 配置 API Key\n设置你所使用的模型提供商的 API Key 环境变量。OpenBench 支持几乎所有主流提供商。\n\n```bash\n# 示例：配置 Groq (或其他如 OPENAI_API_KEY, ANTHROPIC_API_KEY 等)\nexport GROQ_API_KEY=your_api_key_here\n```\n\n### 2. 运行首次评估\n使用 `bench eval` 命令运行基准测试。以下示例在 MMLU 数据集上限制只跑 10 个样本进行快速测试：\n\n```bash\nbench eval mmlu --model groq\u002Fllama-3.3-70b-versatile --limit 10\n```\n\n*   `mmlu`: 基准测试名称（可用 `bench list` 查看所有支持的测试集）。\n*   `--model`: 模型标识符，格式为 `提供商\u002F模型名`。\n*   `--limit`: 限制评估样本数量，适合快速验证。\n\n### 3. 查看结果\n评估完成后，可以通过交互式 UI 查看详细日志和结果：\n\n```bash\nbench view\n```\n\n结果文件默认保存在当前目录的 `.\u002Flogs\u002F` 文件夹中。\n\n---\n\n### 常用命令速查\n\n| 命令 | 说明 |\n| :--- | :--- |\n| `bench list` | 列出所有可用的基准测试套件 |\n| `bench eval \u003Cname>` | 运行指定的基准测试 |\n| `bench view` | 启动交互式界面查看评估日志 |\n| `bench --help` | 查看完整命令选项帮助 |\n\n### 进阶提示\n*   **本地模型**：支持直接评估本地运行的模型，例如：`--model ollama\u002Fllama3` 或 `--model vllm\u002Fmeta-llama\u002FLlama-3-8b`.\n*   **自定义评估**：支持运行本地编写的评估脚本：`bench eval \u002Fpath\u002Fto\u002Fmy_eval.py --model ...`\n*   **重试失败任务**：`bench eval-retry \u003Clog_files>`","某 AI 初创团队需要在发布新版本的医疗问答模型前，快速对比 Groq、Anthropic 和本地部署模型在专业医学知识（MedQA）及长文本病历分析上的表现，以决定最终上线方案。\n\n### 没有 openbench 时\n- **评估框架割裂**：针对不同厂商（如 OpenAI 与本地 Ollama）需编写多套独立的测试脚本，代码重复率高且难以维护。\n- **基准测试不统一**：缺乏标准化的医学与长上下文数据集，团队需手动收集题目并自行设计评分逻辑，结果缺乏行业公信力。\n- **切换成本高昂**：每次更换被测模型都需要修改大量底层 API 调用代码，无法实现“一键切换”进行横向对比。\n- **隐私与合规风险**：敏感的医疗测试数据若通过非受控的第三方平台评估，存在数据泄露隐患，且难以在本地封闭环境中运行。\n\n### 使用 openbench 后\n- **统一评估基础设施**：利用其 Provider-agnostic 特性，仅需一条 `bench eval` 命令即可在同一框架下无缝切换并评估 Groq、Anthropic 及本地模型。\n- **内置权威医疗基准**：直接调用内置的医疗健康与长上下文召回率等 95+ 标准化评测集，无需自建题库，确保结果可复现且具可比性。\n- **极速横向对比**：通过简单的 CLI 参数（如 `-M` 指定模型），几分钟内即可完成多模型在同一任务上的性能跑分，大幅缩短决策周期。\n- **安全本地化评估**：支持通过 `bench eval \u003Cpath>` 在本地私有环境运行自定义医疗评测，确保敏感病历数据不出内网，满足合规要求。\n\nopenbench 让团队从繁琐的评估基建中解放出来，将精力聚焦于模型优化，实现了跨厂商、高标准且安全的模型能力量化决策。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgroq_openbench_b22f64aa.png","groq","Groq","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fgroq_aa5c3e72.png","Groq's primary GitHub organization for internal software projects.",null,"contact@groq.com","GroqInc","https:\u002F\u002Fgroq.com","https:\u002F\u002Fgithub.com\u002Fgroq",[83,87,91],{"name":84,"color":85,"percentage":86},"Python","#3572A5",99.8,{"name":88,"color":89,"percentage":90},"Dockerfile","#384d54",0.1,{"name":92,"color":93,"percentage":90},"Shell","#89e051",758,98,"2026-04-09T11:33:47","MIT","Linux, macOS, Windows","非必需。支持本地模型（Ollama, vLLM）时取决于具体模型需求；云端 API 调用无需本地 GPU。","未说明",{"notes":102,"python":103,"dependencies":104},"该工具主要作为评估基础设施，通过 API 连接 30+ 云服务商（如 OpenAI, Groq, Anthropic 等），因此通常不需要高性能本地硬件。若运行本地模型（需安装 Ollama 或 vLLM），硬件需求取决于所选模型大小。推荐使用 'uv' 工具管理虚拟环境和安装依赖。","3.10+",[105,106],"uv","inspect-ai",[35,14,108],"其他",[110],"managed-by-terraform","2026-03-27T02:49:30.150509","2026-04-10T04:37:27.017530",[114,119,124,129,134,139],{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},27112,"运行评估时遇到 'Unable to initialise OpenAI client' 错误，即使我使用的是 Bedrock 模型并配置了 AWS 凭证怎么办？","这通常是因为您选择的基准测试（如 healthbench）需要基于 GPT-4.1 的模型评分，因此必须配置 OpenAI API 密钥。即使主模型是 Bedrock，评分步骤仍需访问 OpenAI。请确保在环境变量中设置 OPENAI_API_KEY。维护者表示将优化此错误提示以明确说明需求。","https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F109",{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},27113,"PyPI 安装的 openbench 缺少核心代码文件导致无法使用，如何解决？","该问题已在 v0.1.1 版本中修复。如果您遇到此问题，请将 openbench 升级到最新版本：pip install --upgrade openbench。旧版本（如 0.1.0）的打包配置有误，遗漏了 src\u002Fopenbench 下的核心目录。","https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F9",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},27114,"如何获取 JSON 格式的输出结果？README 中提到但似乎不支持 --json 参数。","由于 --json 参数在内部已被占用，建议使用 --log-format 标志来指定输出格式。例如，尝试使用类似 --log-format json 的参数（具体取决于版本实现），这可以替代直接使用的 --json 标志以获得结构化日志输出。","https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F15",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},27115,"在使用 gpt-5-x 系列模型时，设置 temperature 为 0.6 报错，只支持默认值 1 吗？","这是一个已知问题，特定于 gpt-5-x 模型的 temperature 参数兼容性。该问题已在 PR #72 中修复。请确保您将 openbench 更新到包含此修复的最新版本，之后即可正常设置非默认的 temperature 值。","https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F50",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},27116,"MCQ（多项选择题）评分器是否支持 5 个选项（A-E）的题目？","早期版本的正则表达式仅支持 A-D 四个选项。针对包含 5 个选项的基准测试（如 CommonsenseQA），维护者已在 PR #181 中进行了修复工作。请升级到修复后的版本以支持 E 选项的自动识别和评分。","https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F162",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},27117,"OpenBench 是否支持 TauBench 等代理（Agent）评估基准？","是的，团队已确认计划添加对 TauBench 的支持。虽然初始版本主要集中在 LLM -centric 的基准测试，但后续更新将包含针对 Agent 能力的评估。请关注项目更新或查看最新文档以确认该功能是否已发布。","https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F11",[145,150,155,160,165,170,175,180,185,190],{"id":146,"version":147,"summary_zh":148,"released_at":149},180249,"v0.5.3","## [0.5.3](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcompare\u002Fv0.5.2...v0.5.3) (2025-12-08)\n\n\n### 功能特性\n\n* 在 eval 命令中添加 --max-tasks 选项，用于并发执行任务 ([#279](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F279)) ([241e653](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F241e65392b04c747ab92b6ca7fcf3af825326869))\n* 添加 bbq 基准测试 ([#255](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F255)) ([46f4744](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F46f4744fa0e381cc50202532499efb83be1aba0c))\n* 添加 ChartQAPro ([#289](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F289)) ([677f7c7](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F677f7c7aa7f035f798e9135756cfadbebcac0368))\n* 添加可配置的 HuggingFace Hub 配置命名 ([#261](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F261)) ([8abe2ae](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F8abe2aefb3fe11ec4bb9c36deaf1ee4a21cf2656))\n* 添加 DocVQA 基准测试 ([#297](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F297)) ([0dd0edf](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F0dd0edf4fc606a6f967548083eae31149e72922f))\n* 为拼写错误的评估添加模糊匹配建议 ([#303](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F303)) ([625a7b3](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F625a7b333c42ecd653a7d98ee15f154acc7e59d1))\n* 添加 ifbench 基准测试 ([#326](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F326)) ([bd730c2](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fbd730c25fd95fb602f72d45a06cd3744cc08da0a))\n* 添加数学 EvalGroup ([#263](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F263)) ([e0f4a9b](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fe0f4a9b9ab9ebdc327b601bc5d8c1ee52c2e878d))\n* 添加 MathVista 基准测试 ([#298](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F298)) ([5c50a8f](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F5c50a8fd68211257764aea2160dca61fd62c28b9))\n* 从 lighteval 引入 MMLU-Redux 基准测试 ([#321](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F321)) ([d22a587](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fd22a587d98fb27a3c00e2d557805749e8f3b2bcb))\n* 添加 MMVet V2 基准测试 ([#296](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F296)) ([66689de](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F66689de4622f477f8d88cb37c0d541174a91a81d))\n* 添加 OCRBench V2 基准测试 ([#295](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F295)) ([71f3589](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F71f3589f6802d1df587f3c288be608bef622c0fb))\n* 为 simpleqa 和 toxicity 添加可选扩展功能 ([#266](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F266)) ([2450ddf](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F2450ddf76ad546784fbf63ff25f5f48715d65516))\n* 添加 sealqa 基准测试 ([#283](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F283)) ([06b39e4](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F06b39e465cb73241e6cc4289c6417f1e347b7620))\n* 添加 SMT 2024 基准测试 ([#239](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F239)) ([5d9b475](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F5d9b4752133aeb5562f0f22c793a391dcb70bde4))\n* 添加 tau bench 和 pass^k 指标 ([#294](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F294)) ([2bb1242](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F2bb12420c87776cd0570","2025-12-09T00:50:23",{"id":151,"version":152,"summary_zh":153,"released_at":154},180250,"v0.5.2","## [0.5.2](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcompare\u002Fv0.5.1...v0.5.2) (2025-10-16)\n\n\n### 杂项\n\n* 需要手动安装网络插件 ([#252](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F252)) ([090f801](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F090f80123841d398a9f55577654a232b9bad8628))","2025-10-16T05:22:55",{"id":156,"version":157,"summary_zh":158,"released_at":159},180251,"v0.5.1","## [0.5.1](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcompare\u002Fv0.5.0...v0.5.1) (2025-10-16)\n\n\n### 错误修复\n\n* 修复了损坏的链接 ([#247](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F247)) ([5ef66e0](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F5ef66e0908652b4becf11b6d3a8b5497656f9c40))\n\n\n### 杂项\n\n* 更新锁定文件 ([dff1bd9](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fdff1bd92e47bd9d4770da4a46f9ccfb3e90f83b2))\n\n\n### 重构\n\n* 将网络安全基准测试提取到插件中 ([#251](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F251)) ([df829e2](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fdf829e2482a481d4cde2f75d3cc906184eda330d))","2025-10-16T01:32:06",{"id":161,"version":162,"summary_zh":163,"released_at":164},180252,"v0.5.0","## [0.5.0](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcompare\u002Fv0.4.1...v0.5.0) (2025-10-10)\n\n\n### ⚠ 重大变更\n\n* 在基准测试目录下增加了更多分组 ([#244](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F244))\n\n### 功能\n\n* 添加了 clockbench 评估框架及用于合成公开数据集的脚本。([#159](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F159)) ([3ba9836](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F3ba98367252e4c0938d2841ff673662108cebe07))\n* 添加 IFEval ([#182](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F182)) ([8d1b939](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F8d1b939477a5a2202df4a874cb0fa0586fe2d9e3))\n* 在 inspect 中添加了 groq 提供商的本地 openbench 实现 ([#131](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F131)) ([52aea35](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F52aea3510f03cd49d0fb66c3e6789f61939a7ee0))\n* 添加 mmmlu 评估 ([#193](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F193)) ([a42c2d5](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fa42c2d5499a87366cfbb62cc254d47021565a628))\n* 添加 mmstar 基准测试 ([#174](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F174)) ([5d085ab](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F5d085ab4b14a0a176953bfb3134e89e8eb36cb85))\n* 添加新的 openbench 文档 ([#169](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F169)) ([f3e6a37](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Ff3e6a373c3195f90ed3017d81202ded81cde54d2))\n* 添加用于运行所有 18 个 BBH 任务的总控 bbh 命令 ([463a25f](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F463a25f308f8613f742f6bf79aba16d3998a30f3))\n* 添加预设评估组基础设施 ([#215](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F215)) ([d9ea03a](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fd9ea03a76f1a580037cd9760aa5ee3505104e341))\n* 在基准测试目录下增加了更多分组 ([#244](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F244)) ([d932cb0](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fd932cb0b4aa8752a2075706f63485b3fb4dbdec0))\n* **ArabicMMLU:** 添加剩余的 32 个阿拉伯语考试子集，总计 41 个子集 ([#219](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F219)) ([006e248](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F006e2480fd336642f5bbe0fc3a08b521fbbe7dc9))\n* **benchmark:** 添加对 arc-agi 的支持 ([#158](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F158)) ([3f32253](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F3f3225324e7578b7e6e1cbf3261fa73107b75745))\n* **benchmark:** 添加对 detailbench 的支持 ([#154](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F154)) ([23fbca5](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F23fbca5b09b4636b9c07886ce20e10ed1d386eb8))\n* **benchmark:** 添加对 TUMLU 的支持 ([#160](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F160))  ([#161](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F161)) ([885be75](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F885be75b68529e1b81aa964f6fbd0e0ea1de0ccd))\n* **benchmark:** 多项挑战实现 ([#170](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F170)) ([cf2ab4f](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fcf2ab4fcf77195a66360408f9b230cceae3732e8))\n* 将默认模型更改为 groq\u002Fopenai\u002Fgpt-oss-20b ([#","2025-10-10T18:18:45",{"id":166,"version":167,"summary_zh":168,"released_at":169},180253,"v0.4.1","## [0.4.1](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcompare\u002Fv0.4.0...v0.4.1) (2025-08-29)\n\n\n### 错误修复\n\n* **rootly_gmcq：** 在评分器中同时处理字符串和列表类型的内容（[#129](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F129)）（[376624d](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F376624d1c24ce876dc6bbd1a3c09b566e7b303b5)）","2025-08-29T01:11:01",{"id":171,"version":172,"summary_zh":173,"released_at":174},180254,"v0.4.0","## [0.4.0](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcompare\u002Fv0.3.0...v0.4.0) (2025-08-28)\n\n\n### 功能特性\n\n* 添加 boolq ([#70](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F70)) ([edbd1cc](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fedbd1cc1227e83a4de2d1c383acbc2c914063018))\n* 添加 BrowseComp ([#118](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F118)) ([498c706](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F498c7063f67f1ae5f2b68420269ad939e4a684ab))\n* 添加用于软件引用的 CITATION.cff ([#102](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F102)) ([16960de](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F16960dec7b75d55de8f60d78cee99a691a85d083))\n* 添加 CTI-Bench 网络安全基准测试套件 ([#96](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F96)) ([8465075](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F84650753e3b7b2cbe35b150d6b6466985d07e01d))\n* 添加 GitHub 问题和 PR 模板 ([#103](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F103)) ([68f0ef0](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F68f0ef0a514cabaa165ba343ca38d87edebe4452))\n* 添加 gmcq ([#114](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F114)) ([bb3c89d](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fbb3c89d03a6baf1dcc5e608a7438068e7f2f3d35))\n* 添加 MuSR 变体及分组指标 ([#107](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F107)) ([10ae935](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F10ae935254531813b3dc087a7f127c08fad3422e))\n* 将 gpt-oss 的鲁棒答案提取评分器添加到 MathArena 基准测试和 gpqa_diamond 中 ([#97](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F97)) ([251ba66](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F251ba66b5e65cb30f1bd0afaaf1ac4a96e75a0ad))\n* 添加 Vercel AI Gateway 推理提供商 ([#98](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F98)) ([38e211a](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F38e211ab0cfa042b80d7bc62e02b529a816eb090))\n* jsonschemabench ([#95](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F95)) ([e3d842d](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fe3d842d10ee824baa882d8cb9e1f7c3e4adf28e2))\n* **mmmu:** 增加对 mmmu 基准测试及其所有子领域的支持 ([#121](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F121)) ([801bceb](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F801bcebe9e92b71022440a5c5788ac8b377a762e))\n\n\n### 错误修复\n\n* 格式化 mmlu_pro.py 数据集文件 ([2a9ee65](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F2a9ee651f680da10c3e5a2403d9103821e9e52bc))\n* 处理 CI 中跳过的集成测试 ([#120](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F120)) ([dae9378](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fdae937838b90ba39fb134daf694ea4bc3563508c))\n* **hle:** 为 hle 增加多模态支持 ([#128](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F128)) ([8c3f212](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F8c3f212b7cba8650a47f4bb297213265b4fac660))\n* **jsonschemaeval:** 对齐论文方法并添加 openai 子集 ([#113](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F113)) ([1b6470b](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F1b6470b3671f76eb809b8af55593ad9529179546))\n* 将 claude-code-review 作业设为可选，以防止阻塞 PR ([#100](https:\u002F\u002Fgithub","2025-08-29T00:21:29",{"id":176,"version":177,"summary_zh":178,"released_at":179},180255,"v0.3.0","## [0.3.0](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcompare\u002Fv0.2.0...v0.3.0) (2025-08-14)\n\n\n### 功能特性\n\n* 为 eval-retry 命令添加 --debug 标志 ([b26afaa](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fb26afaad31986e184c2695c6384cb1736ac0dfcb))\n* 添加用于模型和任务参数的 -M 和 -T 标志 ([#75](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F75)) ([46a6ba6](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F46a6ba6b8a1d5a05b4ef1e53a9dcc1068967c4a8))\n* 添加 'openbench' 作为替代的 CLI 入口点 ([#48](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F48)) ([68b3c5b](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F68b3c5b4f8b8927dd5c6c8f68e25f831e9a5a222))\n* 添加 AI21 Labs 推理提供商 ([#86](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F86)) ([db7bde7](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fdb7bde7ea72eda2e688dd199d3e04e6505ccf1cc))\n* 添加 Baseten 推理提供商 ([#79](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F79)) ([696e2aa](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F696e2aa760faf94db116405ebccb819e2ce6a2b5))\n* 添加 Cerebras 和 SambaNova 模型提供商 ([1c61f59](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F1c61f597ddc801caf3f085fa29fd35c50fed7b37))\n* 添加 Cohere 推理提供商 ([#90](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F90)) ([8e6e838](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F8e6e838f447c7c0306c2c4f8523c7a9057046b0c))\n* 添加 Crusoe 推理提供商 ([#84](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F84)) ([3d0c794](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F3d0c794dc5ef0d1eb188d3673e18f891850d0965))\n* 添加 DeepInfra 推理提供商 ([#85](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F85)) ([6fedf53](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F6fedf53fa585fcaf9ff9a0bf396eab9a7c6a7f49))\n* 添加 Friendli 推理提供商 ([#88](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F88)) ([7e2b258](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F7e2b25838e0c8725dbb8822099db826deabf2c8a))\n* 添加 Hugging Face 推理提供商 ([#54](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F54)) ([f479703](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Ff479703a08f6605f70592d01a82588486650d49c))\n* 添加 Hyperbolic 推理提供商 ([#80](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F80)) ([4ebf723](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F4ebf723c1577b542cef1c53f6bb254bc13c02a52))\n* 添加初始 GraphWalks 基准实现 ([#58](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F58)) ([1aefd07](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F1aefd07befb8eeaebefd97066518e9d1a0523d73))\n* 添加 Lambda AI 推理提供商 ([#81](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F81)) ([b78c346](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fb78c34690713c740af46d48eeedca967e15c64da))\n* 添加 MiniMax 推理提供商 ([#87](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F87)) ([09fd27b](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F09fd27b4dfe043325c908bbce1aa00430259f2ee))\n* 添加 Moonshot 推理提供商 ([#91](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F91)) ([e5743cb](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fe5743cbf4825c673d46ed98a157fee6e30961e6b))\n* 添加 Nebius 模型提供商","2025-08-14T21:06:58",{"id":181,"version":182,"summary_zh":183,"released_at":184},180256,"v0.2.0","## [0.2.0](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcompare\u002Fv0.1.1...v0.2.0) (2025-08-11)\n\n\n### 功能\n\n* 添加 DROP（简单评估）([#20](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F20)) ([f85bf19](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Ff85bf194971f4a37b917d4d6ec6dfa31a1c3954c))\n* 添加 Humanity's Last Exam (HLE) 基准测试 ([#23](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F23)) ([6f10fb7](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F6f10fb71d6c8cabe8cddbb23bc0c979f8fb7234b))\n* 添加用于数学问题求解的 MATH 和 MATH-500 基准测试 ([#22](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F22)) ([9c6843b](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F9c6843babdfcbb85162cb88e71e3d2c71beeba5b))\n* 添加 MGSM ([#18](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F18)) ([bec1a7c](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fbec1a7c732912b235941e3cedfa1ff4f9092be0f))\n* 添加 openai MRCR 基准测试，用于长上下文回忆能力评估 ([#24](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F24)) ([1b09ebd](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F1b09ebd13e765652ec1b6e8756599a28d9544224))\n* HealthBench ([#16](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F16)) ([2caa47d](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F2caa47dad56faeaede219a41a0555d2887f782bc))\n\n\n### 文档\n\n* 更新 CLAUDE.md，加入 pre-commit 和依赖锁定要求 ([f33730e](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Ff33730e570d55a2da171f0e44a0382bef749421e))\n\n\n### 杂项\n\n* GitHub Terraform：创建\u002F更新 .github\u002Fworkflows\u002Fstale.yaml [跳过 CI] ([1a00342](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F1a00342abde5d93dab3748157493a45dbf6a62b6))","2025-08-11T20:14:52",{"id":186,"version":187,"summary_zh":188,"released_at":189},180257,"v0.1.1","## [0.1.1](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcompare\u002Fv0.1.0...v0.1.1) (2025-07-31)\n\n\n### Bug 修复\n\n* 添加缺失的 `__init__.py` 文件，并修复 PyPI 的包发现问题 ([#10](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fissues\u002F10)) ([29fcdf6](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F29fcdf6fefa48fcf480db1f84cf5845f7f7758ce))\n\n\n### 文档\n\n* 更新 README，简化 OpenBench 的安装说明，并使用 PyPI 安装 ([16e08a0](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F16e08a091b6fcc56422df21d1352bcc88481f175))","2025-07-31T22:43:56",{"id":191,"version":192,"summary_zh":193,"released_at":194},180258,"v0.1.0","## 0.1.0（2025-07-31）\n\n\n### 功能\n\n* openbench ([3265bb0](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F3265bb07929f461a96d608d54fcdb144c66c0ac7))\n\n\n### 杂项\n\n* **ci:** 更新 release-please 工作流，以允许标签管理 ([b70db16](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002Fb70db1665355be278af8a6d06f2a58aeedbe4a31))\n* 删除用于发布的版本号标记 ([58ce995](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F58ce9958b715c2f83fab509afdf046811b18c128))\n* GitHub Terraform：创建\u002F更新 .github\u002Fworkflows\u002Fstale.yaml [跳过 ci] ([555658a](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F555658af369b4e88eb92bf7f2afa2adcc4934835))\n* 更新 0.1.0 版本的项目元数据，添加许可证、自述文件和仓库链接 ([9ea2102](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fopenbench\u002Fcommit\u002F9ea21029ebe3782d3d67b6aa075faf8862440fbf))","2025-07-31T09:53:05"]