[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-LiveBench--LiveBench":3,"tool-LiveBench--LiveBench":64},[4,17,27,35,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",143909,2,"2026-04-07T11:33:18",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85013,"2026-04-06T11:09:19",[26,43,44,45,14,46,15,13,47],"数据工具","视频","插件","其他","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":23,"last_commit_at":54,"category_tags":55,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[14,26,13,15,46],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":23,"last_commit_at":62,"category_tags":63,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",75054,"2026-04-07T10:38:03",[15,26,13,46],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":67,"owner_name":67,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":77,"owner_email":77,"owner_twitter":77,"owner_website":77,"owner_url":78,"languages":79,"stars":88,"forks":89,"last_commit_at":90,"license":91,"difficulty_score":23,"env_os":92,"env_gpu":93,"env_ram":94,"env_deps":95,"category_tags":102,"github_topics":77,"view_count":10,"oss_zip_url":77,"oss_zip_packed_at":77,"status":16,"created_at":103,"updated_at":104,"faqs":105,"releases":135},5220,"LiveBench\u002FLiveBench","LiveBench","LiveBench: A Challenging, Contamination-Free LLM Benchmark","LiveBench 是一个专为大语言模型（LLM）设计的动态评测基准，旨在提供更具挑战性且无数据污染的评估环境。它主要解决了传统基准测试中常见的“数据污染”难题——即模型因在训练阶段见过测试题而获得虚高分数的问题。通过每月发布基于最新数据集、arXiv 论文、新闻及电影简介的新题目，LiveBench 确保测试内容始终新鲜，真实反映模型的泛化能力。\n\n此外，LiveBench 的所有问题均设有可验证的客观标准答案，无需依赖另一个大模型作为裁判即可实现自动化、高精度的评分，涵盖了编码、推理等六大类共 18 项多样化任务。\n\n这款工具非常适合 AI 研究人员、开发者以及希望客观对比模型性能的技术团队使用。其独特的技术亮点在于持续的月度更新机制与完全客观的评分体系，配合对 Docker 容器化代码执行的支持，能够安全、准确地评估复杂的智能体编程任务。无论是想要验证新模型的性能，还是追踪前沿模型的演进趋势，LiveBench 都提供了一个透明、可靠且不断进化的测试平台。","# LiveBench\n\n![Crates.io](https:\u002F\u002Fimg.shields.io\u002Fcrates\u002Fl\u002FAp?color=orange)\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Flivebench.ai\u002F\">🏆 Leaderboard\u003C\u002Fa> •\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Flivebench\">💻 Data \u003C\u002Fa> •\n    \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.19314\">📝 Paper\u003C\u002Fa> \n\u003C\u002Fp>\n\nLiveBench appeared as a [Spotlight Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=sKYHBTAxVa) in ICLR 2025.\n\nTop models as of 30th September 2024 (for a full up-to-date leaderboard, see [here](https:\u002F\u002Flivebench.ai\u002F)):\n\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiveBench_LiveBench_readme_40d134046f32.png)\n\nPlease see the [changelog](changelog.md) for details about each LiveBench release.\n\n## Table of Contents\n\n- [Introduction](#introduction)\n- [Installation Quickstart](#installation-quickstart)\n- [Usage](#usage)\n- [Data](#data)\n- [Adding New Questions](#adding-new-questions)\n- [Evaluating New Models and Configuring API Parameters](#evaluating-new-models-and-configuring-api-parameters)\n- [Documentation](#documentation)\n- [Citation](#citation)\n\n## Introduction\n\nIntroducing LiveBench: a benchmark for LLMs designed with test set contamination and objective evaluation in mind.\n\nLiveBench has the following properties:\n\n* LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses.\n* Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge.\n* LiveBench currently contains a set of 18 diverse tasks across 6 categories, and we will release new, harder tasks over time.\n\n**We will evaluate your model!** Open an [issue](https:\u002F\u002Fgithub.com\u002FLiveBench\u002FLiveBench\u002Fissues) or email us at [livebench@livebench.ai](mailto:livebench@livebench.ai)!\n\n## Installation Quickstart\n\nWe recommend using a virtual environment to install LiveBench.\n```bash\npython -m venv .venv\nsource .venv\u002Fbin\u002Factivate\n```\n\nTo generate answers with API models (i.e. with `gen_api_answer.py`), conduct judgments, and show results:\n\n```bash\ncd LiveBench\npip install -e .\n```\n\nTo score results on the coding tasks (code_completion and code_generation), you will also need to install the required dependencies:\n```bash\ncd livebench\u002Fcode_runner\npip install -r requirements_eval.txt\n```\n\nNote that, to evaluate the agentic coding questions, you will need docker installed and available (i.e. the command `docker --version` should work).\nThis will be checked prior to such tasks being run.\n\n**Note about local models**: Local model inference is unmaintained. We highly recommend serving your model on an OpenAI compatible API using [vllm](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) and performing inference using `run_livebench.py`.\n\nOur repo contains code from [LiveCodeBench](https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002FLiveCodeBench) and [IFEval](https:\u002F\u002Fgithub.com\u002FRohan2002\u002FIFEval?tab=readme-ov-file).\n\n## Usage\n\n```bash\ncd livebench\n```\n\n### Running Evaluations\n\nThe simplest way to run LiveBench inference and scoring is using the `run_livebench.py` script, which handles the entire evaluation pipeline including generating answers, scoring them, and showing results.\n\nBasic usage:\n```bash\npython run_livebench.py --model gpt-4o --bench-name live_bench\u002Fcoding --livebench-release-option 2024-11-25\n```\n\nSome common options:\n- `--bench-name`: Specify which subset(s) of questions to use (e.g. `live_bench` for all questions, `live_bench\u002Fcoding` for coding tasks only)\n- `--model`: The model to evaluate\n- `--max-tokens`: Maximum number of tokens in model responses (defaults to 4096 unless overriden for specific models)\n- `--api-base`: Custom API endpoint for OpenAI-compatible servers\n- `--api-key-name`: Environment variable name containing the API key (defaults to OPENAI_API_KEY for OpenAI models)\n- `--api-key`: Raw API key value\n- `--parallel-requests`: Number of concurrent API requests (for models with high rate limits)\n- `--resume`: Continue from a previous interrupted run\n- `--retry-failures`: Retry questions that failed in previous runs\n- `--livebench-release-option`: Evaluate questions from a specific LiveBench release\n\nRun `python run_livebench.py --help` to see all available options.\n\nWhen this is finished, follow along with [Viewing Results](#viewing-results) to view results.\n\n**Note: The current LiveBench release is 2025-04-25; however, not all questions for this release are public on Huggingface. In order to evaluate all categories, you will need to pass `--livebench-release-option 2024-11-25` to all scripts to use the most recent public questions.**\n\n**Note: Evaluation of the agentic coding tasks require the building of task-specific Docker images. Storing all of these images may take up to 150GB. Images are needed both for inference and evaluation. In the future we will work on optimizing the evaluation process for this task to minimize storage requirements.**\n\n#### Parallel Evaluation Options\n\nLiveBench provides two different arguments for parallelizing evaluations, which can be used independently or together:\n\n- `--mode parallel`: Runs separate tasks\u002Fcategories in parallel by creating multiple tmux sessions. Each category or task runs in its own terminal session, allowing simultaneous evaluation across different benchmark subsets. This also parallelizes the ground truth evaluation phase. By default, this will create one session for each category; if `--bench-name` is supplied, there will be one session for each value of `--bench-name`.\n\n- `--parallel-requests`: Sets the number of concurrent questions to be answered within a single task evaluation instance. This controls how many API requests are made simultaneously for a specific task.\n\n**When to use which option:**\n\n- **For high rate limits (e.g., commercial APIs with high throughput):**\n  - Use both options together for maximum throughput when evaluating the full benchmark.\n  - For example: `python run_livebench.py --model gpt-4o --bench-name live_bench --mode parallel --parallel-requests 10`\n\n- **For lower rate limits:**\n  - When running the entire LiveBench suite, `--mode parallel` is recommended to parallelize across categories, even if `--parallel-requests` must be kept low.\n  - For small subsets of tasks, `--parallel-requests` may be more efficient as the overhead of creating multiple tmux sessions provides less benefit.\n  - Example for lower rate limits on full benchmark: `python run_livebench.py --model claude-3-5-sonnet --bench-name live_bench --mode parallel --parallel-requests 2`\n\n- **For single task evaluation:**\n  - When running just one or two tasks, use only `--parallel-requests`: `python run_livebench.py --model gpt-4o --bench-name live_bench\u002Fcoding --parallel-requests 10`\n\nNote that `--mode parallel` requires tmux to be installed on your system. The number of tmux sessions created will depend on the number of categories or tasks being evaluated.\n\n### Viewing Results\n\nYou can view the results of your evaluations using the `show_livebench_result.py` script:\n\n```bash\npython show_livebench_result.py --bench-name \u003Cbench-name> --model-list \u003Cmodel-list> --question-source \u003Cquestion-source> --livebench-release-option 2024-11-25\n```\n\n`\u003Cmodel-list>` is a space-separated list of model IDs to show. For example, to show the results of gpt-4o and claude-3-5-sonnet on coding tasks, run:\n```bash\npython show_livebench_result.py --bench-name live_bench\u002Fcoding --model-list gpt-4o claude-3-5-sonnet\n```\n\nMultiple `--bench-name` values can be provided to see scores on specific subsets of benchmarks:\n```bash\npython show_livebench_result.py --bench-name live_bench\u002Fcoding live_bench\u002Fmath --model-list gpt-4o\n```\n\nIf no `--model-list` argument is provided, all models will be shown. The `--question-source` argument defaults to `huggingface` but should match what was used during evaluation, as should `--livebench-release-option`.\n\nThe leaderboard will be displayed in the terminal. You can also find the breakdown by category in `all_groups.csv` and by task in `all_tasks.csv`.\n\n### Error Checking\n\nThe `scripts\u002Ferror_check.py` script will print out questions for which a model's output is `$ERROR$`, which indicates repeated API call failures.\nYou can use the `scripts\u002Frerun_failed_questions.py` script to rerun the failed questions, or run `run_livebench.py` as normal with the `--resume` and `--retry-failures` arguments.\n\nBy default, LiveBench will retry API calls three times and will include a delay in between attempts to account for rate limits. If the errors seen during evaluation are due to rate limits, you may need to switch to `--mode single` or `--mode sequential` and decrease the value of `--parallel-requests`. If after multiple attempts, the model's output is still `$ERROR$`, it's likely that the question is triggering some content filter from the model's provider (Gemini models are particularly prone to this, with an error of `RECITATION`). In this case, there is not much that can be done. We consider such failures to be incorrect responses.\n\n## Data\nThe questions for each of the categories can be found below:\n- [Reasoning](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Freasoning)\n- [Math](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Fmath)\n- [Coding](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Fcoding)\n- [Language](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Flanguage)\n- [Data Analysis](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Fdata_analysis)\n- [Instruction Following](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Finstruction_following)\n\nAlso available are the [model answers](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Fmodel_answer) and the [model judgments](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Fmodel_judgment).\n\nTo download the `question.jsonl` files (for inspection) and answer\u002Fjudgment files from the leaderboard, use\n```bash\npython download_questions.py\npython download_leaderboard.py\n```\n\nQuestions will be downloaded to `livebench\u002Fdata\u002F\u003Ccategory>\u002Fquestion.jsonl`.\n\n## Evaluating New Questions\nIf you want to create your own set of questions, or try out different prompts, etc, follow these steps:\n\n- Create a `question.jsonl` file with the following path (or, run `python download_questions.py` and update the downloaded file): `livebench\u002Fdata\u002Flive_bench\u002F\u003Ccategory>\u002F\u003Ctask>\u002Fquestion.jsonl`. For example, `livebench\u002Fdata\u002Freasoning\u002Fweb_of_lies_new_prompt\u002Fquestion.jsonl`. Here is an example of the format for `question.jsonl` (it's the first few questions from [web_of_lies_v2](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Freasoning)):\n\n```jsonl\n{\"question_id\": \"0daa7ca38beec4441b9d5c04d0b98912322926f0a3ac28a5097889d4ed83506f\", \"category\": \"reasoning\", \"ground_truth\": \"no, yes, yes\", \"turns\": [\"In this question, assume each person either always tells the truth or always lies. Tala is at the movie theater. The person at the restaurant says the person at the aquarium lies. Ayaan is at the aquarium. Ryan is at the botanical garden. The person at the park says the person at the art gallery lies. The person at the museum tells the truth. Zara is at the museum. Jake is at the art gallery. The person at the art gallery says the person at the theater lies. Beatriz is at the park. The person at the movie theater says the person at the train station lies. Nadia is at the campground. The person at the campground says the person at the art gallery tells the truth. The person at the theater lies. The person at the amusement park says the person at the aquarium tells the truth. Grace is at the restaurant. The person at the aquarium thinks their friend is lying. Nia is at the theater. Kehinde is at the train station. The person at the theater thinks their friend is lying. The person at the botanical garden says the person at the train station tells the truth. The person at the aquarium says the person at the campground tells the truth. The person at the aquarium saw a firetruck. The person at the train station says the person at the amusement park lies. Mateo is at the amusement park. Does the person at the train station tell the truth? Does the person at the amusement park tell the truth? Does the person at the aquarium tell the truth? Think step by step, and then put your answer in **bold** as a list of three words, yes or no (for example, **yes, no, yes**). If you don't know, guess.\"], \"task\": \"web_of_lies_v2\"}\n```\n\n- If adding a new task, create a new scoring method in the `process_results` folder. If it is similar to an existing task, you can copy that task's scoring function. For example, `livebench\u002Fprocess_results\u002Freasoning\u002Fweb_of_lies_new_prompt\u002Futils.py` can be a copy of the `web_of_lies_v2` scoring method.\n- Add the scoring function to `gen_ground_truth_judgment.py` [here](https:\u002F\u002Fgithub.com\u002FLiveBench\u002FLiveBench\u002Fblob\u002F93e3a7d4fa5bb164ef4cb58f67683e4e54554af9\u002Flivebench\u002Fgen_ground_truth_judgment.py#L124).\n\n- Run and score models using `--question-source jsonl` and specifying your task. For example: \n```bash \npython gen_api_answer.py --bench-name live_bench\u002Freasoning\u002Fweb_of_lies_new_prompt --model claude-3-5-sonnet --question-source jsonl\npython gen_ground_truth_judgment.py --bench-name live_bench\u002Freasoning\u002Fweb_of_lies_new_prompt --question-source jsonl\npython show_livebench_result.py --bench-name live_bench\u002Freasoning\u002Fweb_of_lies_new_prompt\n```\n\n## Evaluating New Models and Configuring API Parametersdee\n\nAny API-based model for which there is an OpenAI compatible endpoint should work out of the box using the `--api-base` and `--api-key` (or `--api-key-name`) arguments. If you'd like to override the name of the model for local files (e.g. saving it as `deepseek-v3` instead of `deepseek-chat`), use the `--model-display-name` argument. You can also override values for temperature and max tokens using the `--force-temperature` and `--max-tokens` arguments, respectively.\n\nIf you'd like to have persistent model configuration without needing to specify command-line arguments, you can create a model configuration document in a yaml file in `livebench\u002Fmodel\u002Fmodel_configs`. See the other files there for examples of the necessary format. Important values are `model_display_name`, which determines the answer .jsonl file name and model ID used for other scripts, and `api_name`, which provides a mapping between API providers and names for the model in that API. For instance, Deepseek R1 can be evaluated using the Deepseek API with a name of `deepseek-reasoner` and the Together API with a name of `deepseek-ai\u002Fdeepseek-r1`. `api_kwargs` allows you to set overrides for parameters such as temperature, max tokens, and top p, for all providers or for specific ones. Once this is set, you can use `--model \u003Cmodel_name>` with the `model_display_name` value you put in the yaml document when running `run_livebench.py`.\n\nWhen performing inference, use the `--model-provider-override` argument to override the provider you'd like to use for the model.\n\nWe have also implemented inference for Anthropic, Cohere, Mistral, Together, and Google models, so those should also all work immediately either by using `--model-provider-override` or adding a new entry to the appropriate configuration file.\n\nIf you'd like to use a model with a new provider that is not OpenAI-compatible, you will need to implement a new completions function in `completions.py` and add it to `get_api_function` in that file; then, you can use it in your model configuration.\n\n## Documentation\nHere, we describe our dataset documentation. This information is also available in our paper.\n- [Author Responsibility](docs\u002FAUTHOR_RESPONSIBILITY.md)\n- [Code of Conduct](docs\u002FCODE_OF_CONDUCT.md)\n- [Contributing](docs\u002FCONTRIBUTING.md)\n- [Datasheet for LiveBench](docs\u002FDATASHEET.md)\n- [Maintenance Plan](docs\u002FMAINTENANCE_PLAN.md)\n\n## Citation\n\n```bibtex\n@inproceedings{livebench,\n  title={LiveBench: A Challenging, Contamination-Free {LLM} Benchmark},\n  author={Colin White and Samuel Dooley and Manley Roberts and Arka Pal and Benjamin Feuer and Siddhartha Jain and Ravid Shwartz-Ziv and Neel Jain and Khalid Saifullah and Sreemanti Dey and Shubh-Agrawal and Sandeep Singh Sandha and Siddartha Venkat Naidu and Chinmay Hegde and Yann LeCun and Tom Goldstein and Willie Neiswanger and Micah Goldblum},\n  booktitle={The Thirteenth International Conference on Learning Representations},\n  year={2025},\n}\n```\n","# LiveBench\n\n![Crates.io](https:\u002F\u002Fimg.shields.io\u002Fcrates\u002Fl\u002FAp?color=orange)\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Flivebench.ai\u002F\">🏆 排行榜\u003C\u002Fa> •\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Flivebench\">💻 数据 \u003C\u002Fa> •\n    \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.19314\">📝 论文\u003C\u002Fa> \n\u003C\u002Fp>\n\nLiveBench 作为 [Spotlight Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=sKYHBTAxVa) 出现在 ICLR 2025 上。\n\n截至 2024 年 9 月 30 日的顶级模型（完整且最新的排行榜请见 [这里](https:\u002F\u002Flivebench.ai\u002F)）：\n\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiveBench_LiveBench_readme_40d134046f32.png)\n\n有关每次 LiveBench 发布的详细信息，请参阅 [changelog](changelog.md)。\n\n## 目录\n\n- [简介](#introduction)\n- [安装快速入门](#installation-quickstart)\n- [使用方法](#usage)\n- [数据](#data)\n- [添加新问题](#adding-new-questions)\n- [评估新模型及配置 API 参数](#evaluating-new-models-and-configuring-api-parameters)\n- [文档](#documentation)\n- [引用](#citation)\n\n## 简介\n\n隆重推出 LiveBench：一个专为 LLM 设计的基准测试，旨在解决测试集污染问题，并实现客观评估。\n\nLiveBench 具有以下特点：\n\n* LiveBench 每月发布新问题，以减少潜在的测试集污染；同时，问题基于近期发布的数据集、arXiv 论文、新闻文章以及 IMDb 电影梗概。\n* 每道题都配有可验证的客观标准答案，因此即使难度较高的题目也能被准确、自动地评分，无需借助 LLM 作为评判者。\n* LiveBench 目前包含 6 个类别下的 18 项多样化任务，未来我们将陆续推出更多难度更高的任务。\n\n**我们愿意为您评估模型！** 请在 [GitHub](https:\u002F\u002Fgithub.com\u002FLiveBench\u002FLiveBench\u002Fissues) 上提交一个问题，或发送邮件至 [livebench@livebench.ai](mailto:livebench@livebench.ai)！\n\n## 安装快速入门\n\n建议使用虚拟环境来安装 LiveBench。\n```bash\npython -m venv .venv\nsource .venv\u002Fbin\u002Factivate\n```\n\n若要通过 API 模型生成答案（即使用 `gen_api_answer.py`）、进行判断并展示结果：\n```bash\ncd LiveBench\npip install -e .\n```\n\n若需对编码任务（代码补全和代码生成）的结果进行评分，还需安装必要的依赖项：\n```bash\ncd livebench\u002Fcode_runner\npip install -r requirements_eval.txt\n```\n\n请注意，要评估代理式编码问题，您需要安装并可用 Docker（即运行 `docker --version` 命令应能成功）。在执行此类任务之前，系统会检查是否已安装 Docker。\n\n**关于本地模型的说明**：本地模型推理功能已不再维护。我们强烈建议您使用 [vllm](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) 将您的模型部署为兼容 OpenAI 的 API，并通过 `run_livebench.py` 进行推理。\n\n我们的仓库包含了来自 [LiveCodeBench](https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002FLiveCodeBench) 和 [IFEval](https:\u002F\u002Fgithub.com\u002FRohan2002\u002FIFEval?tab=readme-ov-file) 的代码。\n\n## 使用方法\n\n```bash\ncd livebench\n```\n\n### 运行评估\n\n运行 LiveBench 推理与评分最简单的方式是使用 `run_livebench.py` 脚本，该脚本会处理整个评估流程，包括生成答案、评分并展示结果。\n\n基本用法：\n```bash\npython run_livebench.py --model gpt-4o --bench-name live_bench\u002Fcoding --livebench-release-option 2024-11-25\n```\n\n一些常用选项：\n- `--bench-name`：指定要使用的子集问题（例如 `live_bench` 表示所有问题，`live_bench\u002Fcoding` 表示仅编码任务）\n- `--model`：要评估的模型\n- `--max-tokens`：模型响应的最大 token 数量（默认为 4096，除非针对特定模型覆盖）\n- `--api-base`：用于兼容 OpenAI 服务器的自定义 API 端点\n- `--api-key-name`：包含 API 密钥的环境变量名称（默认为 OpenAI 模型的 OPENAI_API_KEY）\n- `--api-key`：API 密钥的原始值\n- `--parallel-requests`：并发 API 请求数量（适用于具有高速率限制的模型）\n- `--resume`：从上次中断的地方继续运行\n- `--retry-failures`：重试之前运行失败的问题\n- `--livebench-release-option`：评估特定 LiveBench 版本中的问题\n\n运行 `python run_livebench.py --help` 可查看所有可用选项。\n\n完成上述操作后，请按照 [查看结果](#viewing-results) 部分查看评估结果。\n\n**注意：当前 LiveBench 的最新版本是 2025-04-25；然而，该版本的所有问题尚未在 Huggingface 上公开。为了评估所有类别，您需要在所有脚本中添加 `--livebench-release-option 2024-11-25` 参数，以使用最近公开的问题。**\n\n**注意：代理式编码任务的评估需要构建特定于任务的 Docker 镜像。存储所有这些镜像可能占用高达 150GB 的空间。这些镜像既用于推理也用于评估。未来我们将致力于优化此任务的评估流程，以尽量减少存储需求。**\n\n#### 并行评估选项\n\nLiveBench 提供了两个不同的参数来并行化评估，这两个参数可以单独使用，也可以结合使用：\n\n- `--mode parallel`：通过创建多个 tmux 会话，并行运行不同的任务或类别。每个类别或任务都在独立的终端会话中运行，从而实现跨不同基准子集的同时评估。此选项也会并行化真值评估阶段。默认情况下，它会为每个类别创建一个会话；如果提供了 `--bench-name` 参数，则会为 `--bench-name` 的每个值创建一个会话。\n\n- `--parallel-requests`：设置在一个任务评估实例中同时回答的问题数量。这控制着针对特定任务同时发出的 API 请求数量。\n\n**何时使用哪个选项：**\n\n- **对于高速率限制（例如吞吐量高的商业 API）：**\n  - 在评估整个基准时，建议同时使用这两个选项以获得最大吞吐量。\n  - 例如：`python run_livebench.py --model gpt-4o --bench-name live_bench --mode parallel --parallel-requests 10`\n\n- **对于较低的速率限制：**\n  - 在运行整个 LiveBench 套件时，建议使用 `--mode parallel` 来实现跨类别的并行化，即使 `--parallel-requests` 必须保持较低。\n  - 对于小型任务子集，`--parallel-requests` 可能更高效，因为创建多个 tmux 会话所带来的额外开销较小。\n  - 低速率限制下运行完整基准的示例：`python run_livebench.py --model claude-3-5-sonnet --bench-name live_bench --mode parallel --parallel-requests 2`\n\n- **对于单个任务的评估：**\n  - 当只运行一两个任务时，只需使用 `--parallel-requests`：`python run_livebench.py --model gpt-4o --bench-name live_bench\u002Fcoding --parallel-requests 10`\n\n请注意，`--mode parallel` 需要您的系统上已安装 tmux。创建的 tmux 会话数量将取决于正在评估的类别或任务数量。\n\n### 查看结果\n\n您可以使用 `show_livebench_result.py` 脚本查看评估结果：\n\n```bash\npython show_livebench_result.py --bench-name \u003Cbench-name> --model-list \u003Cmodel-list> --question-source \u003Cquestion-source> --livebench-release-option 2024-11-25\n```\n\n`\u003Cmodel-list>` 是一个以空格分隔的模型 ID 列表，用于指定要显示的模型。例如，要查看 gpt-4o 和 claude-3-5-sonnet 在编码任务上的结果，请运行：\n```bash\npython show_livebench_result.py --bench-name live_bench\u002Fcoding --model-list gpt-4o claude-3-5-sonnet\n```\n\n您还可以提供多个 `--bench-name` 参数，以查看特定基准子集上的得分：\n```bash\npython show_livebench_result.py --bench-name live_bench\u002Fcoding live_bench\u002Fmath --model-list gpt-4o\n```\n\n如果未提供 `--model-list` 参数，则会显示所有模型的结果。`--question-source` 参数默认为 `huggingface`，但应与评估时使用的来源一致，同样地，`--livebench-release-option` 也应匹配。\n\n排行榜将显示在终端中。您还可以在 `all_groups.csv` 文件中找到按类别划分的详细结果，在 `all_tasks.csv` 文件中找到按任务划分的详细结果。\n\n### 错误检查\n\n`scripts\u002Ferror_check.py` 脚本会打印出模型输出为 `$ERROR$` 的问题，这表示 API 调用多次失败。您可以使用 `scripts\u002Frerun_failed_questions.py` 脚本重新运行这些失败的问题，或者正常运行 `run_livebench.py` 并添加 `--resume` 和 `--retry-failures` 参数。\n\n默认情况下，LiveBench 会重试 API 调用三次，并在每次尝试之间加入延迟以应对速率限制。如果评估过程中出现的错误是由于速率限制导致的，您可能需要切换到 `--mode single` 或 `--mode sequential` 模式，并减少 `--parallel-requests` 的值。如果经过多次尝试后，模型的输出仍然是 `$ERROR$`，则很可能是该问题触发了模型提供商的内容过滤机制（Gemini 模型尤其容易出现这种情况，错误信息为 `RECITATION`）。在这种情况下，通常无法进一步处理。我们视此类失败为不正确的回答。\n\n## 数据\n\n各个类别的问题可以在以下链接中找到：\n- [推理](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Freasoning)\n- [数学](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Fmath)\n- [编码](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Fcoding)\n- [语言](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Flanguage)\n- [数据分析](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Fdata_analysis)\n- [指令遵循](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Finstruction_following)\n\n此外，还提供了 [模型答案](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Fmodel_answer) 和 [模型判断](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Fmodel_judgment)。\n\n要下载排行榜中的 `question.jsonl` 文件（用于检查）以及答案和判断文件，可以使用以下命令：\n```bash\npython download_questions.py\npython download_leaderboard.py\n```\n\n问题将被下载到 `livebench\u002Fdata\u002F\u003Ccategory>\u002Fquestion.jsonl` 目录下。\n\n## 评估新问题\n\n如果您想创建自己的问题集，或尝试不同的提示等，可以按照以下步骤操作：\n\n- 创建一个 `question.jsonl` 文件，路径如下（或者运行 `python download_questions.py` 并更新下载的文件）：`livebench\u002Fdata\u002Flive_bench\u002F\u003Ccategory>\u002F\u003Ctask>\u002Fquestion.jsonl`。例如，`livebench\u002Fdata\u002Freasoning\u002Fweb_of_lies_new_prompt\u002Fquestion.jsonl`。以下是 `question.jsonl` 文件的格式示例（来自 [web_of_lies_v2](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivebench\u002Freasoning) 的前几道题）：\n\n```jsonl\n{\"question_id\": \"0daa7ca38beec4441b9d5c04d0b98912322926f0a3ac28a5097889d4ed83506f\", \"category\": \"reasoning\", \"ground_truth\": \"no, yes, yes\", \"turns\": [\"在这个问题中，假设每个人要么总是说真话，要么总是说谎话。塔拉在电影院。餐厅里的人说水族馆里的人在说谎。阿扬在水族馆。瑞安在植物园。公园里的人说艺术画廊里的人在说谎。博物馆里的人说的是真话。扎拉在博物馆。杰克在艺术画廊。艺术画廊里的人说剧院里的人在说谎。贝亚特丽斯在公园。电影院里的人说火车站的人在说谎。纳迪娅在露营地。露营地里的人说艺术画廊的人说的是真话。剧院里的人在说谎。游乐场里的人说水族馆的人说的是真话。格蕾丝在餐厅。水族馆里的人认为他的朋友在说谎。尼亚在剧院。凯欣德在火车站。剧院里的人认为他的朋友在说谎。植物园里的人说火车站的人说的是真话。水族馆里的人说露营地的人说的是真话。水族馆里的人看到了一辆消防车。火车站里的人说游乐场的人在说谎。马特奥在游乐场。那么，火车站的人说的是真话吗？游乐场的人说的是真话吗？水族馆的人说的是真话吗？请一步一步思考，然后用粗体字将你的答案写成一个由三个词组成的列表，分别是“是”或“否”（例如：“是，否，是”）。如果你不知道答案，就猜一猜。\"], \"task\": \"web_of_lies_v2\"}\n```\n\n- 如果要添加新任务，需在 `process_results` 文件夹中创建新的评分方法。如果新任务与现有任务类似，可以直接复制现有任务的评分函数。例如，`livebench\u002Fprocess_results\u002Freasoning\u002Fweb_of_lies_new_prompt\u002Futils.py` 可以直接复制 `web_of_lies_v2` 的评分方法。\n- 将评分函数添加到 `gen_ground_truth_judgment.py` 中的相应位置：[此处](https:\u002F\u002Fgithub.com\u002FLiveBench\u002FLiveBench\u002Fblob\u002F93e3a7d4fa5bb164ef4cb58f67683e4e54554af9\u002Flivebench\u002Fgen_ground_truth_judgment.py#L124)。\n\n- 使用 `--question-source jsonl` 并指定您的任务来运行并评估模型。例如：\n```bash\npython gen_api_answer.py --bench-name live_bench\u002Freasoning\u002Fweb_of_lies_new_prompt --model claude-3-5-sonnet --question-source jsonl\npython gen_ground_truth_judgment.py --bench-name live_bench\u002Freasoning\u002Fweb_of_lies_new_prompt --question-source jsonl\npython show_livebench_result.py --bench-name live_bench\u002Freasoning\u002Fweb_of_lies_new_prompt\n```\n\n## 评估新模型及配置 API 参数dee\n\n任何具备 OpenAI 兼容端点的基于 API 的模型，只需使用 `--api-base` 和 `--api-key`（或 `--api-key-name`）参数即可开箱即用。若希望为本地文件覆盖模型名称（例如将保存名为 `deepseek-v3` 而非 `deepseek-chat`），可使用 `--model-display-name` 参数。此外，您还可以分别通过 `--force-temperature` 和 `--max-tokens` 参数来覆盖温度和最大标记数的值。\n\n若希望实现持久化的模型配置而无需每次指定命令行参数，可在 `livebench\u002Fmodel\u002Fmodel_configs` 目录下创建一个 YAML 格式的模型配置文件。请参考该目录下的其他文件以了解所需格式。其中重要的字段包括 `model_display_name`，它决定了答案 .jsonl 文件名以及供其他脚本使用的模型 ID；还有 `api_name`，用于建立 API 提供商与该 API 中模型名称之间的映射关系。例如，Deepseek R1 可以通过 Deepseek API 以 `deepseek-reasoner` 作为名称进行评估，也可以通过 Together API 以 `deepseek-ai\u002Fdeepseek-r1` 作为名称进行评估。`api_kwargs` 允许您为所有提供商或特定提供商设置温度、最大标记数和 top p 等参数的覆盖值。完成上述配置后，在运行 `run_livebench.py` 时，即可使用 `--model \u003Cmodel_name>`，其中 `\u003Cmodel_name>` 应与 YAML 文件中填写的 `model_display_name` 值一致。\n\n在执行推理时，可使用 `--model-provider-override` 参数来覆盖您希望为该模型使用的提供商。\n\n我们还实现了对 Anthropic、Cohere、Mistral、Together 和 Google 模型的推理支持，因此这些模型同样可以通过使用 `--model-provider-override` 或向相应配置文件添加新条目而立即投入使用。\n\n若要使用与 OpenAI 不兼容的新提供商的模型，则需要在 `completions.py` 中实现一个新的补全函数，并将其添加到该文件中的 `get_api_function` 函数中；随后，您便可在模型配置中使用该函数。\n\n## 文档\n在此，我们介绍我们的数据集文档。相关信息亦可在我们的论文中找到。\n- [作者责任](docs\u002FAUTHOR_RESPONSIBILITY.md)\n- [行为准则](docs\u002FCODE_OF_CONDUCT.md)\n- [贡献指南](docs\u002FCONTRIBUTING.md)\n- [LiveBench 数据集说明](docs\u002FDATASHEET.md)\n- [维护计划](docs\u002FMAINTENANCE_PLAN.md)\n\n## 引用\n\n```bibtex\n@inproceedings{livebench,\n  title={LiveBench: 一个具有挑战性且无污染的 {LLM} 基准测试},\n  author={Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum},\n  booktitle={第十三届国际学习表征会议},\n  year={2025},\n}\n```","# LiveBench 快速上手指南\n\nLiveBench 是一个专为大语言模型（LLM）设计的基准测试工具，旨在通过每月发布新问题来减少测试集污染，并提供可验证的客观真实答案，无需依赖 LLM 裁判即可自动评分。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **操作系统**: Linux 或 macOS (Windows 用户建议使用 WSL2)\n*   **Python**: 版本 3.8 或更高\n*   **Docker**: 如果您计划评估**代理编码任务 (Agentic Coding)**，必须安装 Docker 并确保 `docker --version` 命令可用。此类任务需要构建特定的 Docker 镜像（可能占用高达 150GB 空间）。\n*   **tmux**: 如果使用并行评估模式 (`--mode parallel`)，需安装 tmux。\n\n> **注意**：本地模型推理支持已不再维护。强烈建议使用 [vllm](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) 将模型部署为 OpenAI 兼容的 API 服务，然后通过 `run_livebench.py` 进行评测。\n\n## 安装步骤\n\n推荐使用虚拟环境进行安装，以避免依赖冲突。\n\n1.  **创建并激活虚拟环境**：\n    ```bash\n    python -m venv .venv\n    source .venv\u002Fbin\u002Factivate\n    ```\n\n2.  **安装 LiveBench 核心包**：\n    进入项目目录并安装可编辑版本：\n    ```bash\n    cd LiveBench\n    pip install -e .\n    ```\n    *(国内用户如遇下载慢，可添加 `-i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple` 使用清华源)*\n\n3.  **安装代码任务依赖（可选）**：\n    如果您需要评分代码完成 (`code_completion`) 和代码生成 (`code_generation`) 任务，需额外安装相关依赖：\n    ```bash\n    cd livebench\u002Fcode_runner\n    pip install -r requirements_eval.txt\n    ```\n\n## 基本使用\n\nLiveBench 提供了统一的脚本 `run_livebench.py` 来处理从生成答案、评分到展示结果的完整流程。\n\n### 运行评估\n\n以下是最简单的运行示例，用于评估 `gpt-4o` 模型在编码任务上的表现：\n\n```bash\ncd livebench\npython run_livebench.py --model gpt-4o --bench-name live_bench\u002Fcoding --livebench-release-option 2024-11-25\n```\n\n**关键参数说明：**\n*   `--model`: 要评估的模型名称（如 `gpt-4o`, `claude-3-5-sonnet`）。\n*   `--bench-name`: 指定测试子集。例如 `live_bench` (全部), `live_bench\u002Fcoding` (仅编码), `live_bench\u002Fmath` (仅数学)。\n*   `--livebench-release-option`: **重要**。当前最新公开的问题版本为 `2024-11-25`。若不指定此参数，可能无法获取所有类别的最新公开题目。\n*   `--api-key`: 如果您的 API Key 未设置在环境变量中，可直接在此处传入。默认读取 `OPENAI_API_KEY`。\n\n### 查看结果\n\n评估完成后，使用以下命令查看终端排行榜：\n\n```bash\npython show_livebench_result.py --bench-name live_bench\u002Fcoding --model-list gpt-4o --livebench-release-option 2024-11-25\n```\n\n*   结果详情也可在生成的 `all_groups.csv` (按类别) 和 `all_tasks.csv` (按任务) 文件中查看。\n\n### 进阶提示：并行加速\n\n为了加快评估速度，您可以根据 API 速率限制调整并行策略：\n\n*   **高吞吐量场景**（如商业 API）：同时开启任务并行和请求并行。\n    ```bash\n    python run_livebench.py --model gpt-4o --bench-name live_bench --mode parallel --parallel-requests 10 --livebench-release-option 2024-11-25\n    ```\n*   **低速率限制场景**：仅开启任务并行，降低并发请求数。\n    ```bash\n    python run_livebench.py --model claude-3-5-sonnet --bench-name live_bench --mode parallel --parallel-requests 2 --livebench-release-option 2024-11-25\n    ```","某 AI 初创团队在发布新大模型前，急需向投资人证明其模型具备真实的推理能力，而非仅仅“背题”的高手。\n\n### 没有 LiveBench 时\n- **评估结果虚高**：模型在静态榜单上得分很高，但因训练数据污染（见过考题），实际处理新问题时表现糟糕，导致误判模型真实水平。\n- **评分主观且昂贵**：缺乏标准答案的开放性问题依赖另一个大模型（LLM Judge）来打分，不仅成本高，还引入了评分偏差和不一致性。\n- **迭代反馈滞后**：基准测试集更新缓慢，无法及时反映模型对最新新闻、论文或代码库的理解能力，难以指导快速迭代。\n- **自动化程度低**：需要人工编写脚本处理不同任务的评分逻辑，尤其是代码生成任务，环境配置复杂且容易出错。\n\n### 使用 LiveBench 后\n- **杜绝数据污染**：利用每月更新的基于最新论文、新闻和电影简介的题目，确保模型无法通过“死记硬背”作弊，真实暴露其泛化能力。\n- **客观自动评分**：所有问题均设有可验证的客观标准答案，无需引入裁判模型即可实现精准、低成本的自动化打分。\n- **紧跟前沿动态**：测试内容实时涵盖最新发布的数据集和技术文章，能立即检验模型对前沿知识的掌握情况，为优化提供即时反馈。\n- **一站式评估流程**：通过 `run_livebench.py` 脚本一键完成从答案生成到代码沙箱运行评分的全流程，大幅降低评估门槛和维护成本。\n\nLiveBench 通过持续更新的防污染题库和客观自动评分机制，让大模型的能力评估从“猜题游戏”回归到真实的智力较量。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiveBench_LiveBench_40d13404.png","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FLiveBench_1e09abde.png","",null,"https:\u002F\u002Fgithub.com\u002FLiveBench",[80,84],{"name":81,"color":82,"percentage":83},"Python","#3572A5",99.8,{"name":85,"color":86,"percentage":87},"Shell","#89e051",0.2,1124,100,"2026-04-06T20:11:34","NOASSERTION","Linux, macOS","未说明 (主要依赖 API 调用；若使用 vllm 部署本地模型，需参考 vllm 的 GPU 要求)","未说明 (评估代理编码任务时，存储 Docker 镜像可能需高达 150GB 磁盘空间)",{"notes":96,"python":97,"dependencies":98},"1. 本地模型推理代码已不再维护，强烈建议使用 vllm 将模型部署为 OpenAI 兼容 API 后进行评估。\n2. 评估代理编码任务（agentic coding）必须安装 Docker，且构建特定任务的镜像可能占用高达 150GB 存储空间。\n3. 使用并行评估模式（--mode parallel）需要系统安装 tmux。\n4. 默认通过 API 进行推理，无需本地 GPU，但需配置相应的 API Key。","未说明 (建议使用虚拟环境)",[99,100,101],"vllm (用于本地模型服务)","docker (用于代理编码任务评估)","tmux (用于并行评估模式)",[15,46],"2026-03-27T02:49:30.150509","2026-04-08T04:02:57.676555",[106,111,116,121,126,131],{"id":107,"question_zh":108,"answer_zh":109,"source_url":110},23668,"评估推理模型（如 QwQ、DeepSeek-R1、Qwen3）时，应该使用什么样的采样参数（Temperature, TopP 等）？","对于具有思考模式（thinking mode）的推理模型，默认的温度 0 可能导致模型陷入循环或表现不佳。建议配置如下：\n1. **思考模式**：Temperature=0.6, TopP=0.95, TopK=20, MinP=0。切勿使用贪婪解码（greedy decoding），否则会导致性能下降和无限重复。\n2. **非思考模式**：Temperature=0.7, TopP=0.8, TopK=20, MinP=0。\n注意：OpenAI 和 Anthropic 通常不允许手动设置推理模型的温度，但在本地部署（如 vLLM）时需要显式指定这些参数以获得最佳结果。","https:\u002F\u002Fgithub.com\u002FLiveBench\u002FLiveBench\u002Fissues\u002F156",{"id":112,"question_zh":113,"answer_zh":114,"source_url":115},23669,"如何提交我的模型到 LiveBench 排行榜？","目前基准测试代码针对每个任务会重新加载模型，这在大规模评估时效率较低。虽然官方计划未来优化此逻辑（修改 `load_questions` 和 `gen_api_answer.py` 以支持一次性加载），但短期内用户若需自行提交，可能需要手动调整代码逻辑：\n1. 修改 `livebench\u002Fcommon.py` 中的 `load_questions` 函数，使其返回所有问题而非按任务过滤。\n2. 修改 `livebench\u002Fgen_api_answer.py` 中的相关逻辑，移除按类别和任务循环的代码，确保模型只加载一次。\n如有具体实施问题，建议在 Issue 中进一步询问维护者。","https:\u002F\u002Fgithub.com\u002FLiveBench\u002FLiveBench\u002Fissues\u002F6",{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},23670,"运行 gen_ground_truth_judgment.py 时出现 FileNotFoundError 错误怎么办？","该错误通常是因为脚本试图读取不存在的输出文件（例如 `data\u002Flive_bench\u002Fcoding\u002Fcoding\u002Fmodel_judgment\u002Fground_truth_judgment.jsonl`）。这往往意味着之前的生成步骤未完成或路径配置有误。\n解决方法：\n1. 确认是否已先生成了模型回答文件（model_answer）。\n2. 检查目录结构是否与日志中显示的任务类别一致（如 coding, math, reasoning 等）。\n3. 尝试重新运行评估流程，确保中间文件正确生成。如果问题依旧，可以参考其他用户的复现结果（如 OpenHermes 的评估数据）来验证环境配置是否正确。","https:\u002F\u002Fgithub.com\u002FLiveBench\u002FLiveBench\u002Fissues\u002F5",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},23671,"如何在 Google Colab 中运行 LiveBench 时解决 'ModuleNotFoundError: No module named livebench' 错误？","在 Google Colab 中运行时，首先需要确保已正确安装 livebench 包。由于官方目前没有提供 Dockerfile，且 Colab 环境较为特殊，遇到模块找不到错误通常是因为未在当前会话中安装库。\n解决步骤：\n1. 在 Colab 单元格中运行安装命令：`!pip install git+https:\u002F\u002Fgithub.com\u002FLiveBench\u002FLiveBench.git`（或根据实际仓库地址安装）。\n2. 如果遇到 'Service Unavailable' 等网络错误，这通常是 Google 服务端的问题，建议稍后重试。\n3. 确保导入语句前已执行安装操作，因为 Colab 重启后会丢失已安装的包。","https:\u002F\u002Fgithub.com\u002FLiveBench\u002FLiveBench\u002Fissues\u002F152",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},23672,"LiveBench 是否会更新最新的大模型版本（如 o1, Qwen3, Qwen2.5 系列）？","是的，LiveBench 会根据社区请求和模型发布情况持续更新排行榜。\n- 对于 **Qwen 系列**（包括 Qwen2.5, Qwen3, QwQ），社区强烈建议添加所有尺寸模型（特别是 7B, 14B, 32B 及 Math\u002FCoder 专用模型），维护者通常会采纳这些建议。\n- 对于 **o1\u002Fo1-mini** 的更新版本，维护者会评估其性能变化。如果新版本（如 o1-low）与现有版本在排名上没有显著差异，可能不会单独列入主榜，但可能会在特定过滤视图或讨论中提及。\n用户可以通过提交 Issue 请求添加特定模型，并提供相关的技术报告链接或 HuggingFace 地址。","https:\u002F\u002Fgithub.com\u002FLiveBench\u002FLiveBench\u002Fissues\u002F211",{"id":132,"question_zh":133,"answer_zh":134,"source_url":110},23673,"为什么我本地复现的模型分数与排行榜上的分数不一致？","分数不一致的主要原因通常是**采样参数配置不同**或**上下文长度设置差异**。\n1. **采样参数**：许多推理模型对 Temperature 非常敏感。如果使用默认的 Temperature=0，模型可能会陷入死循环或输出质量极低，导致分数远低于预期。请确保使用了模型推荐的参数（如 Temperature=0.6 或 0.7, TopP=0.95 等）。\n2. **上下文长度**：确保本地部署（如使用 vLLM）时设置了足够的上下文窗口（例如 32k 或更高），以匹配基准测试的要求。\n3. **验证方法**：可以先尝试复现排行榜上其他已知模型（如 Gemma 3, Llama 3.3）的分数，如果这些能对上，说明环境基本无误，问题出在目标模型的特定配置上。",[]]