[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-LiveCodeBench--LiveCodeBench":3,"tool-LiveCodeBench--LiveCodeBench":65},[4,17,27,35,48,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",150037,2,"2026-04-10T23:33:47",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[26,43,44,45,14,46,15,13,47],"数据工具","视频","插件","其他","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":54,"last_commit_at":55,"category_tags":56,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,43,46],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":54,"last_commit_at":63,"category_tags":64,"status":16},6590,"gpt4all","nomic-ai\u002Fgpt4all","GPT4All 是一款让普通电脑也能轻松运行大型语言模型（LLM）的开源工具。它的核心目标是打破算力壁垒，让用户无需依赖昂贵的显卡（GPU）或云端 API，即可在普通的笔记本电脑和台式机上私密、离线地部署和使用大模型。\n\n对于担心数据隐私、希望完全掌控本地数据的企业用户、研究人员以及技术爱好者来说，GPT4All 提供了理想的解决方案。它解决了传统大模型必须联网调用或需要高端硬件才能运行的痛点，让日常设备也能成为强大的 AI 助手。无论是希望构建本地知识库的开发者，还是单纯想体验私有化 AI 聊天的普通用户，都能从中受益。\n\n技术上，GPT4All 基于高效的 `llama.cpp` 后端，支持多种主流模型架构（包括最新的 DeepSeek R1 蒸馏模型），并采用 GGUF 格式优化推理速度。它不仅提供界面友好的桌面客户端，支持 Windows、macOS 和 Linux 等多平台一键安装，还为开发者提供了便捷的 Python 库，可轻松集成到 LangChain 等生态中。通过简单的下载和配置，用户即可立即开始探索本地大模型的无限可能。",77307,"2026-04-11T06:52:37",[15,13],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":68,"owner_name":68,"owner_avatar_url":76,"owner_bio":77,"owner_company":77,"owner_location":77,"owner_email":77,"owner_twitter":77,"owner_website":77,"owner_url":78,"languages":79,"stars":84,"forks":85,"last_commit_at":86,"license":87,"difficulty_score":23,"env_os":88,"env_gpu":89,"env_ram":88,"env_deps":90,"category_tags":96,"github_topics":97,"view_count":10,"oss_zip_url":77,"oss_zip_packed_at":77,"status":16,"created_at":104,"updated_at":105,"faqs":106,"releases":137},5819,"LiveCodeBench\u002FLiveCodeBench","LiveCodeBench","Official repository for the paper \"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code\"","LiveCodeBench 是一个专为评估大语言模型编程能力而设计的开源基准测试平台。它致力于解决传统代码评测中常见的“数据污染”问题，即模型因在训练阶段见过测试题而导致分数虚高。为此，LiveCodeBench 持续从 LeetCode、AtCoder 和 CodeForces 三大主流竞赛平台收集最新发布的题目，确保评测内容的时效性与公正性。\n\n除了基础的代码生成任务，LiveCodeBench 还将评估维度扩展至代码自我修复、代码执行以及测试输出预测等更广泛的能力场景，提供全方位的性能洞察。其数据集保持动态更新，目前已收录上千道高质量编程难题，并划分了多个版本以支持不同时间跨度的研究需求。\n\n这一工具特别适合 AI 研究人员、大模型开发者以及需要客观对比模型代码实力的技术团队使用。通过内置的排行榜和数据探索器，用户可以轻松追踪模型在真实新题上的表现。LiveCodeBench 凭借其对“防污染”机制的坚持和对多维代码能力的覆盖，为社区提供了一个可靠、透明且不断进化的评估标准，帮助开发者更准确地判断模型的实际编码水平。","# LiveCodeBench\nOfficial repository for the paper \"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code\"\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Flivecodebench.github.io\u002F\">🏠 Home Page\u003C\u002Fa> •\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Flivecodebench\u002F\">💻 Data \u003C\u002Fa> •\n    \u003Ca href=\"https:\u002F\u002Flivecodebench.github.io\u002Fleaderboard.html\">🏆 Leaderboard\u003C\u002Fa> •\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flivecodebench\u002Fcode_generation_samples\">🔍 Explorer\u003C\u002Fa> \n\u003C\u002Fp>\n\n## Introduction\nLiveCodeBench provides holistic and contamination-free evaluation of coding capabilities of LLMs.  Particularly, LiveCodeBench continuously collects new problems over time from contests across three competition platforms -- LeetCode, AtCoder, and CodeForces. Next, LiveCodeBench also focuses on a broader range of code-related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and March 2024.\n\n\n## Installation\nYou can clone the repository using the following command:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002FLiveCodeBench.git\ncd LiveCodeBench\n```\n\nWe recommend using [uv](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv)\nfor managing dependencies, which can be installed a [number of ways](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv?tab=readme-ov-file#installation).\n\nVerify that `uv` is installed on your system by running:\n\n```bash\nuv --version\n```\n\nOnce `uv` has been installed, use it to create a virtual environment for\nLiveCodeBench and install its dependencies with the following commands:\n\n```bash\nuv venv --python 3.11\nsource .venv\u002Fbin\u002Factivate\n\nuv pip install -e .\n```\n\n## Data\nWe provide a benchmark for different code capability scenarios\n- [Code Generation](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivecodebench\u002Fcode_generation_lite)\n- [Code Execution](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivecodebench\u002Fexecution)\n- [Test Output Prediction](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivecodebench\u002Ftest_generation)\n\n## Inference and Evaluation\n\n### Dataset Versions\nSince LiveCodeBench is a continuously updated benchmark, we provide different versions of the dataset. Particularly, we provide the following versions of the dataset:\n- `release_v1`: The initial release of the dataset with problems released between May 2023 and Mar 2024 containing 400 problems.\n- `release_v2`: The updated release of the dataset with problems released between May 2023 and May 2024 containing 511 problems.\n- `release_v3`: The updated release of the dataset with problems released between May 2023 and Jul 2024 containing 612 problems.\n- `release_v4`: The updated release of the dataset with problems released between May 2023 and Sep 2024 containing 713 problems.\n- `release_v5`: The updated release of the dataset with problems released between May 2023 and Jan 2025 containing 880 problems.\n- `release_v6`: The updated release of the dataset with problems released between May 2023 and Apr 2025 containing 1055 problems.\n\nYou can use the `--release_version` flag to specify the dataset version you wish to use. Particularly, you can use the following command to run the evaluation on the `release_v2` dataset. Release version defaults to `release_latest`. Additionally, we have introduced fine-grained release versions such as `v1`, `v2`, `v1_v3`, `v4_v5` for specific versions of the dataset.\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate --release_version release_v2\n```\n\n### Code Generation\n\nWe use `vllm` for inference using open models. By default, we use  `tensor_parallel_size=${num_gpus}` to parallelize inference across all available GPUs. It can be configured using the  `--tensor_parallel_size` flag as required. \n\nFor running the inference, please provide the `model_name` based on the [.\u002Flcb_runner\u002Flm_styles.py](.\u002Flcb_runner\u002Flm_styles.py) file.\nThe scenario (here `codegeneration`) can be used to specify the scenario for the model.\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codegeneration\n```\n\nAdditionally, `--use_cache` flag can be used to cache the generated outputs and `--continue_existing` flag can be used to use the existing dumped results. In case you wish to use model from a local path, you can additionally provide `--local_model_path` flag with the path to the model. We use `n=10` and `temperature=0.2` for generation. Please check the [.\u002Flcb_runner\u002Frunner\u002Fparser.py](.\u002Flcb_runner\u002Frunner\u002Fparser.py) file for more details on the flags.\n\nFor closed API models,  `--multiprocess` flag can be used to parallelize queries to API servers (adjustable according to rate limits).\n\n\n#### Evaluation\nWe compute `pass@1` and `pass@5` metrics for model evaluations.\nWe use a modified version of the checker released with the [`apps` benchmark](https:\u002F\u002Fgithub.com\u002Fhendrycks\u002Fapps\u002Fblob\u002Fmain\u002Feval\u002Ftesting_util.py) to compute the metrics. Particularly, we identified some unhandled edge cases in the original checker and fixed them and additionally simplified the checker based on our collected dataset. To run the evaluation, you can add the `--evaluate` flag:\n\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate\n```\n\nNote that time limits can cause slight (`\u003C 0.5`) points of variation in the computation of the `pass@1` and `pass@5` metrics.\nIf you observe a significant variation in performance, adjust the `--num_process_evaluate` flag to a lower value or increase the `--timeout` flag. Please report particular issues caused by improper timeouts here. \n\nFinally, to get scores over different time windows, you can use [.\u002Flcb_runner\u002Fevaluation\u002Fcompute_scores.py](.\u002Flcb_runner\u002Fevaluation\u002Fcompute_scores.py) file. \nParticularly, you can provide `--start_date` and `--end_date` flags (using the `YYYY-MM-DD` format) to get scores over the specified time window. In our paper, to counter contamination in the DeepSeek models, we only report results on problems released after August 2023. You can replicate those evaluations using:\n\n```bash\npython -m lcb_runner.evaluation.compute_scores --eval_all_file {saved_eval_all_file} --start_date 2023-09-01\n```\n\n**NOTE: We have pruned a large number of test cases from the original benchmark and created `code_generation_lite` which is set as the default benchmark offering similar performance estimation much faster. If you wish to use the original benchmark, please use the `--not_fast` flag. We are in the process of updating the leaderboard scores with this updated setting.** \n\n**NOTE: V2 Update: to run the update LiveCodeBench please use `--release_version release_v2`. In addition, if you have existing results from `release_v1` you can add `--continue_existing` or better `--continue_existing_with_eval` flags to reuse the old completions or evaluations respectively.**\n\n\n### Self Repair\nFor running self repair, you need to provide an additional `--codegen_n` flag that maps to the number of codes that were generated during code generation. Additionally, the `--temperature` flag is used to resolve the old code generation eval file which must be present in the `output` directory. \n\n```bash\npython -m lcb_runner.runner.main --model {model_name --scenario selfrepair --codegen_n {num_codes_codegen} --n 1 # only n=1 supported\n```\n\nIn case you have results on a smaller subset or version of the benchmark, you can use `--continue_existing` and `--continue_existing_with_eval` flags to reuse the old computations. Particularly, you can run the following command to continue from existing generated solutions.\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario selfrepair --evaluate --continue_existing\n```\n\nNote that this will only reuse the generated samples and rerun evaluations. To reuse the old evaluations, you can add the `--continue_existing_with_eval` flag.\n\n### Test Output Prediction\nFor running the test output prediction scenario you can simply run\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario testoutputprediction --evaluate\n```\n\n### Code Execution\nFor running the test output prediction scenario you can simply run\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codeexecution --evaluate\n```\n\nAdditionally, we support the COT setting with\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codeexecution --cot_code_execution --evaluate\n```\n\n## Custom Evaluation\nAlternatively, you can using [`lcb_runner\u002Frunner\u002Fcustom_evaluator.py`](.\u002Flcb_runner\u002Frunner\u002Fcustom_evaluator.py) to directly evaluated model generations in a custom file. The file should contain a list of model outputs, appropriately formatted for evaluation in the order of benchmark problems. \n\n```bash\npython -m lcb_runner.runner.custom_evaluator --custom_output_file {path_to_custom_outputs}\n```\n\nParticularly, arrange the outputs in the following format\n\n```json\n[\n    {\"question_id\": \"id1\", \"code_list\": [\"code1\", \"code2\"]},\n    {\"question_id\": \"id2\", \"code_list\": [\"code1\", \"code2\"]}\n]\n```\n\n\n## Adding Support for New Models\n\nTo add support for new models, we have implemented an extensible framework to add new models and customize prompts appropriately. \n\nStep 1: Add a new model to the [.\u002Flcb_runner\u002Flm_styles.py](.\u002Flcb_runner\u002Flm_styles.py) file. Particularly, extend the `LMStyle` class to add a new model family and extend the model to the `LanguageModelList` array.\n\nStep 2: Since we use instruction tuned models, we allow configuring the instruction for each model. Modify the [.\u002Flcb_runner\u002Fprompts\u002Fgeneration.py](.\u002Flcb_runner\u002Fprompts\u002Fgeneration.py) file to add a new prompt for the model in the `format_prompt_generation` function. \nFor example, the prompt for `DeepSeekCodeInstruct` family of models looks as follows\n\n```python\n# .\u002Flcb_runner\u002Fprompts\u002Fgeneration.py\nif LanguageModelStyle == LMStyle.DeepSeekCodeInstruct:\n    prompt = f\"{PromptConstants.SYSTEM_MESSAGE_DEEPSEEK}\\n\\n\"\n    prompt += f\"{get_deepseekcode_question_template_answer(question)}\"\n    return prompt\n```\n\n## Submit Models to Leaderboard\nWe are currently only accepting submissions for only the code generation scenario. To submit models you can create a pull request on our [submissions](https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002Fsubmissions). Particularly, you can copy your model generations folder from `output` to the `submissions` folder and create a pull request. We will review the submission and add the model to the leaderboard accordingly. \n\n## ERRATA\nWe maintain a list of known issues and updates in the [ERRATA.md](.\u002FERRATA.md) file. Particularly, we document issues regarding erroneous tests and problems not amenable to autograding. We are constantly using this feedback to improve our problem selection heuristics as we update LiveCodeBench.\n\n## Results\nLiveCodeBench can be used to evaluate performance of LLMs on different time-windows (using problem release date to filter the models). \nThus we can detect and prevent potential contamination in the evaluation process and evaluate LLMs on _new_ problems.\n\n\u003Cdiv style=\"text-align: center;\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiveCodeBench_LiveCodeBench_readme_d177f1288bc0.png\" alt=\"Code Generation Live Evaluation\" class=\"teaser-image\"\n    width=\"40%\" \u002F>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiveCodeBench_LiveCodeBench_readme_3920bea1df97.png\" alt=\"Test Output Prediction Live Evaluation\" class=\"teaser-image\"\n    width=\"40%\" \u002F>\n\u003C\u002Fdiv>\n\nNext, we evaluate models on different code capabilities and find that relative performances of models do change over tasks (left). \nThus, it highlights the need for holistic evaluation of LLMs for code.\n\n\u003Cdiv style=\"text-align: center;\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiveCodeBench_LiveCodeBench_readme_ad4d930e1b8e.png\" alt=\"Holistic Tasks Evaluation\" class=\"teaser-image\"\n    width=\"36.1%\" \u002F>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiveCodeBench_LiveCodeBench_readme_5e1f8649a446.png\" alt=\"Comparing LCB vs HumanEval\" class=\"teaser-image\"\n    width=\"46%\" \u002F>\n\u003C\u002Fdiv>\n\nWe also find evidence of possible overfitting on HumanEval (right). \nParticularly, models that perform well on HumanEval do not necessarily perform well on LiveCodeBench. \nIn the scatterplot above, we find the models get clustered into two groups, shaded in red and green. \nThe red group contains models that perform well on HumanEval but poorly on LiveCodeBench, while the green group contains models that perform well on both.\n\nFor more details, please refer to our website at [livecodebench.github.io](https:\u002F\u002Flivecodebench.github.io).\n\n## Citation\n\n```bibtex\n@article{jain2024livecodebench,\n  author    = {Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica},\n  title     = {LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code},\n  year      = {2024},\n  journal   = {arXiv preprint},\n}\n```\n","# LiveCodeBench\n论文《LiveCodeBench：面向代码的大语言模型的整体且无污染评估》的官方仓库\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Flivecodebench.github.io\u002F\">🏠 首页\u003C\u002Fa> •\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Flivecodebench\u002F\">💻 数据\u003C\u002Fa> •\n    \u003Ca href=\"https:\u002F\u002Flivecodebench.github.io\u002Fleaderboard.html\">🏆 排行榜\u003C\u002Fa> •\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flivecodebench\u002Fcode_generation_samples\">🔍 浏览器\u003C\u002Fa> \n\u003C\u002Fp>\n\n## 简介\nLiveCodeBench 提供了对大语言模型编码能力的整体且无污染评估。具体而言，LiveCodeBench 会持续从三个竞赛平台——LeetCode、AtCoder 和 CodeForces——的比赛中收集新的题目。此外，LiveCodeBench 不仅关注代码生成，还涵盖了更广泛的代码相关能力，如自我修复、代码执行和测试输出预测等。目前，LiveCodeBench 拥有四百道高质量的编程题目，这些题目发布于 2023 年 5 月至 2024 年 3 月期间。\n\n\n## 安装\n您可以通过以下命令克隆仓库：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002FLiveCodeBench.git\ncd LiveCodeBench\n```\n\n我们推荐使用 [uv](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) 来管理依赖项，它可以通过多种方式安装[参见此处](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv?tab=readme-ov-file#installation)。\n\n请通过运行以下命令验证 `uv` 是否已安装在您的系统上：\n\n```bash\nuv --version\n```\n\n一旦 `uv` 安装完毕，您可以使用它为 LiveCodeBench 创建虚拟环境，并通过以下命令安装其依赖项：\n\n```bash\nuv venv --python 3.11\nsource .venv\u002Fbin\u002Factivate\n\nuv pip install -e .\n```\n\n## 数据\n我们提供了针对不同代码能力场景的基准数据集：\n- [代码生成](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivecodebench\u002Fcode_generation_lite)\n- [代码执行](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivecodebench\u002Fexecution)\n- [测试输出预测](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flivecodebench\u002Ftest_generation)\n\n## 推理与评估\n\n### 数据集版本\n由于 LiveCodeBench 是一个持续更新的基准，我们提供了多个数据集版本。具体来说，我们提供以下版本的数据集：\n- `release_v1`：初始发布的数据集，包含 2023 年 5 月至 2024 年 3 月期间发布的 400 道题目。\n- `release_v2`：更新后的数据集，包含 2023 年 5 月至 2024 年 5 月期间发布的 511 道题目。\n- `release_v3`：进一步更新的数据集，包含 2023 年 5 月至 2024 年 7 月期间发布的 612 道题目。\n- `release_v4`：再更新的数据集，包含 2023 年 5 月至 2024 年 9 月期间发布的 713 道题目。\n- `release_v5`：最新更新的数据集，包含 2023 年 5 月至 2025 年 1 月期间发布的 880 道题目。\n- `release_v6`：最终更新的数据集，包含 2023 年 5 月至 2025 年 4 月期间发布的 1055 道题目。\n\n您可以使用 `--release_version` 标志来指定要使用的数据集版本。例如，以下命令将使用 `release_v2` 数据集进行评估。默认情况下，`release_version` 设置为 `release_latest`。此外，我们还引入了细粒度的版本标识，如 `v1`、`v2`、`v1_v3`、`v4_v5` 等，用于特定版本的数据集。\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate --release_version release_v2\n```\n\n### 代码生成\n\n对于开源模型的推理，我们使用 `vllm` 工具。默认情况下，我们会根据可用的 GPU 数量设置 `tensor_parallel_size=${num_gpus}`，以实现多 GPU 并行推理。您也可以根据需要使用 `--tensor_parallel_size` 标志进行自定义配置。\n\n在运行推理时，请根据 [.\u002Flcb_runner\u002Flm_styles.py](.\u002Flcb_runner\u002Flm_styles.py) 文件中的内容提供 `model_name`。同时，您可以使用 `scenario` 参数（例如 `codegeneration`）来指定模型的应用场景。\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codegeneration\n```\n\n此外，您可以使用 `--use_cache` 标志来缓存生成结果，或使用 `--continue_existing` 标志来复用已保存的推理结果。如果您希望使用本地路径下的模型，还可以添加 `--local_model_path` 标志并指定模型路径。我们在生成过程中通常设置 `n=10` 和 `temperature=0.2`。更多关于各参数的详细信息，请参阅 [.\u002Flcb_runner\u002Frunner\u002Fparser.py](.\u002Flcb_runner\u002Frunner\u002Fparser.py) 文件。\n\n对于闭源 API 模型，可以使用 `--multiprocess` 标志来并行化对 API 服务器的请求（可根据速率限制进行调整）。\n\n\n#### 评估\n我们计算 `pass@1` 和 `pass@5` 这两个指标来评估模型性能。为了计算这些指标，我们基于 [`apps` 基准](https:\u002F\u002Fgithub.com\u002Fhendrycks\u002Fapps\u002Fblob\u002Fmain\u002Feval\u002Ftesting_util.py)中发布的检查器进行了修改。具体来说，我们发现原始检查器存在一些未处理的边界情况，并对其进行了修复；同时，我们也根据收集到的数据集简化了检查器。要运行评估，只需添加 `--evaluate` 标志即可：\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate\n```\n\n需要注意的是，时间限制可能会导致 `pass@1` 和 `pass@5` 指标的计算结果出现轻微波动（小于 0.5 分）。如果发现性能差异较大，可以尝试降低 `--num_process_evaluate` 的值，或增加 `--timeout` 的值。对于因超时设置不当而导致的问题，请在此处反馈。\n\n最后，若需获取不同时间段内的得分，可以使用 [.\u002Flcb_runner\u002Fevaluation\u002Fcompute_scores.py](.\u002Flcb_runner\u002Fevaluation\u002Fcompute_scores.py) 文件。您可以通过提供 `--start_date` 和 `--end_date` 标志（格式为 `YYYY-MM-DD`），来计算指定时间段内的得分。在我们的论文中，为了消除 DeepSeek 模型可能存在的数据污染问题，我们仅报告了 2023 年 8 月之后发布的题目的评估结果。您可以使用以下命令来复现这些评估结果：\n\n```bash\npython -m lcb_runner.evaluation.compute_scores --eval_all_file {saved_eval_all_file} --start_date 2023-09-01\n```\n\n**注意：我们已从原始基准中删减了大量测试用例，创建了 `code_generation_lite` 版本，该版本作为默认基准，能够在更短时间内提供相似的性能评估。如果您希望使用原始基准，请使用 `--not_fast` 标志。我们目前正在更新排行榜上的分数，以反映这一调整。**\n\n**注意：V2 更新：要运行 LiveCodeBench 的 V2 版本，请使用 `--release_version release_v2`。此外，如果您已有 `release_v1` 的结果，可以添加 `--continue_existing` 或更好的是 `--continue_existing_with_eval` 标志，分别用于复用之前的推理结果或评估结果。**\n\n### 自我修复\n要运行自我修复，您需要提供一个额外的 `--codegen_n` 标志，该标志对应于代码生成过程中生成的代码数量。此外，`--temperature` 标志用于解析旧的代码生成评估文件，该文件必须位于 `output` 目录中。\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario selfrepair --codegen_n {num_codes_codegen} --n 1 # 仅支持 n=1\n```\n\n如果您已经在较小的基准测试子集或版本上获得了结果，可以使用 `--continue_existing` 和 `--continue_existing_with_eval` 标志来重用之前的计算结果。特别是，您可以运行以下命令从已有的生成解继续：\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario selfrepair --evaluate --continue_existing\n```\n\n请注意，这只会重用生成的样本并重新运行评估。若要重用之前的评估结果，可以添加 `--continue_existing_with_eval` 标志。\n\n### 测试输出预测\n要运行测试输出预测场景，只需执行以下命令：\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario testoutputprediction --evaluate\n```\n\n### 代码执行\n要运行代码执行场景，只需执行以下命令：\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codeexecution --evaluate\n```\n\n此外，我们还支持 COT 设置，具体如下：\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codeexecution --cot_code_execution --evaluate\n```\n\n## 自定义评估\n或者，您可以使用 [`lcb_runner\u002Frunner\u002Fcustom_evaluator.py`](.\u002Flcb_runner\u002Frunner\u002Fcustom_evaluator.py) 直接在自定义文件中评估模型生成的结果。该文件应包含按基准问题顺序排列、格式正确的模型输出列表。\n\n```bash\npython -m lcb_runner.runner.custom_evaluator --custom_output_file {path_to_custom_outputs}\n```\n\n特别地，输出应按照以下格式排列：\n\n```json\n[\n    {\"question_id\": \"id1\", \"code_list\": [\"code1\", \"code2\"]},\n    {\"question_id\": \"id2\", \"code_list\": [\"code1\", \"code2\"]}\n]\n```\n\n## 添加对新模型的支持\n为了支持新模型，我们实现了一个可扩展的框架，以便添加新模型并相应地自定义提示。\n\n步骤 1：将新模型添加到 [.\u002Flcb_runner\u002Flm_styles.py](.\u002Flcb_runner\u002Flm_styles.py) 文件中。具体来说，扩展 `LMStyle` 类以添加新的模型系列，并将该模型添加到 `LanguageModelList` 数组中。\n\n步骤 2：由于我们使用的是指令微调模型，因此允许为每个模型配置指令。修改 [.\u002Flcb_runner\u002Fprompts\u002Fgeneration.py](.\u002Flcb_runner\u002Fprompts\u002Fgeneration.py) 文件，在 `format_prompt_generation` 函数中为该模型添加新的提示。\n例如，`DeepSeekCodeInstruct` 系列模型的提示如下：\n\n```python\n# .\u002Flcb_runner\u002Fprompts\u002Fgeneration.py\nif LanguageModelStyle == LMStyle.DeepSeekCodeInstruct:\n    prompt = f\"{PromptConstants.SYSTEM_MESSAGE_DEEPSEEK}\\n\\n\"\n    prompt += f\"{get_deepseekcode_question_template_answer(question)}\"\n    return prompt\n```\n\n## 提交模型至排行榜\n目前我们仅接受代码生成场景的提交。要提交模型，您可以在我们的 [submissions](https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002Fsubmissions) 仓库中创建一个拉取请求。具体而言，您可以将 `output` 目录中的模型生成文件夹复制到 `submissions` 文件夹中，并创建拉取请求。我们将审核您的提交，并相应地将模型添加到排行榜中。\n\n## 错误与更新\n我们在 [ERRATA.md](.\u002FERRATA.md) 文件中维护了一份已知问题和更新的列表。特别是，我们记录了有关错误测试以及无法自动评分的问题。我们会不断利用这些反馈来改进问题选择策略，从而持续更新 LiveCodeBench。\n\n## 结果\nLiveCodeBench 可用于评估大型语言模型在不同时间窗口上的性能（通过问题发布日期筛选模型）。这样我们可以检测并防止评估过程中的潜在污染，从而在 _新_ 问题上评估大型语言模型。\n\n\u003Cdiv style=\"text-align: center;\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiveCodeBench_LiveCodeBench_readme_d177f1288bc0.png\" alt=\"代码生成实时评估\" class=\"teaser-image\"\n    width=\"40%\" \u002F>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiveCodeBench_LiveCodeBench_readme_3920bea1df97.png\" alt=\"测试输出预测实时评估\" class=\"teaser-image\"\n    width=\"40%\" \u002F>\n\u003C\u002Fdiv>\n\n接下来，我们评估了模型在不同代码能力上的表现，发现模型在不同任务上的相对表现确实会发生变化（左图）。这凸显了对大型语言模型进行代码相关综合评估的必要性。\n\n\u003Cdiv style=\"text-align: center;\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiveCodeBench_LiveCodeBench_readme_ad4d930e1b8e.png\" alt=\"综合任务评估\" class=\"teaser-image\"\n    width=\"36.1%\" \u002F>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiveCodeBench_LiveCodeBench_readme_5e1f8649a446.png\" alt=\"比较 LCB 与 HumanEval\" class=\"teaser-image\"\n    width=\"46%\" \u002F>\n\u003C\u002Fdiv>\n\n我们还发现了模型可能在 HumanEval 上过拟合的证据（右图）。具体而言，那些在 HumanEval 上表现良好的模型并不一定能在 LiveCodeBench 上取得好成绩。在上面的散点图中，我们可以看到模型被分为两组，分别用红色和绿色标注。红色组的模型在 HumanEval 上表现良好，但在 LiveCodeBench 上表现较差；而绿色组的模型则在这两个基准上都表现出色。\n\n更多详情，请访问我们的网站：[livecodebench.github.io](https:\u002F\u002Flivecodebench.github.io)。\n\n## 引用\n```bibtex\n@article{jain2024livecodebench,\n  author    = {Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica},\n  title     = {LiveCodeBench: 大型语言模型代码相关综合且无污染评估},\n  year      = {2024},\n  journal   = {arXiv 预印本},\n}\n```","# LiveCodeBench 快速上手指南\n\nLiveCodeBench 是一个持续更新、防污染的大语言模型代码能力评估基准。它从 LeetCode、AtCoder 和 CodeForces 等平台收集最新题目，支持代码生成、自我修复、代码执行和测试输出预测等多种评估场景。\n\n## 环境准备\n\n*   **操作系统**: Linux 或 macOS (Windows 用户建议使用 WSL2)\n*   **Python 版本**: 推荐 **Python 3.11**\n*   **依赖管理工具**: 推荐使用 [`uv`](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) 进行极速依赖安装和环境管理。\n    *   *国内加速提示*：如果 `uv` 安装或下载依赖较慢，可在后续 pip 安装步骤中配置国内镜像源。\n\n## 安装步骤\n\n### 1. 克隆仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002FLiveCodeBench.git\ncd LiveCodeBench\n```\n\n### 2. 安装 uv (如未安装)\n请根据 [uv 官方文档](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv?tab=readme-ov-file#installation) 安装。验证安装：\n```bash\nuv --version\n```\n\n### 3. 创建虚拟环境并安装依赖\n使用 `uv` 创建 Python 3.11 环境并安装项目依赖。\n*(注：若需使用国内 PyPI 镜像加速，可在 install 命令后添加 `--index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`)*\n\n```bash\nuv venv --python 3.11\nsource .venv\u002Fbin\u002Factivate\n\nuv pip install -e .\n```\n\n## 基本使用\n\nLiveCodeBench 的核心功能是通过 `lcb_runner` 模块运行推理和评估。以下以最常见的**代码生成 (Code Generation)** 场景为例。\n\n### 1. 运行推理与评估\n将 `{model_name}` 替换为你想要测试的模型名称（需已在 `.\u002Flcb_runner\u002Flm_styles.py` 中定义，如 `deepseek-coder-33b-instruct` 等）。\n\n默认情况下，脚本会自动使用最新的数据集版本 (`release_latest`) 并进行评估。\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate\n```\n\n**常用参数说明：**\n*   `--release_version`: 指定数据集版本，例如 `release_v5` (截至 2025 年 1 月) 或 `release_v6` (截至 2025 年 4 月)。\n*   `--use_cache`: 缓存生成的输出，避免重复推理。\n*   `--local_model_path`: 如果模型在本地路径，使用此参数指定路径。\n*   `--tensor_parallel_size`: 指定使用的 GPU 数量进行并行推理（默认为所有可用 GPU）。\n\n**示例：使用特定版本数据集和本地模型进行评估**\n```bash\npython -m lcb_runner.runner.main --model my_local_model --scenario codegeneration --evaluate --release_version release_v6 --local_model_path \u002Fpath\u002Fto\u002Fmodel --tensor_parallel_size 2\n```\n\n### 2. 查看结果\n运行结束后，评估结果（包括 `pass@1` 和 `pass@5` 指标）通常会保存在 `output` 目录下的相应文件中。你可以使用提供的脚本按时间窗口计算分数，以检测数据污染情况：\n\n```bash\npython -m lcb_runner.evaluation.compute_scores --eval_all_file {saved_eval_all_file} --start_date 2023-09-01\n```\n\n### 3. 其他评估场景\nLiveCodeBench 还支持其他代码能力评估，只需更改 `--scenario` 参数：\n\n*   **测试输出预测**:\n    ```bash\n    python -m lcb_runner.runner.main --model {model_name} --scenario testoutputprediction --evaluate\n    ```\n*   **代码执行**:\n    ```bash\n    python -m lcb_runner.runner.main --model {model_name} --scenario codeexecution --evaluate\n    ```\n*   **自我修复 (Self Repair)**:\n    需要先进行代码生成，然后指定生成的代码数量 `--codegen_n`。\n    ```bash\n    python -m lcb_runner.runner.main --model {model_name} --scenario selfrepair --codegen_n 10 --evaluate\n    ```\n\n> **注意**：为了加快评估速度，默认使用的是精简版数据集 (`code_generation_lite`)。如需使用包含完整测试用例的原始基准，请添加 `--not_fast` 标志。","某 AI 实验室团队正在研发新一代代码大模型，急需在发布前对其编程能力进行权威且公正的验收评估。\n\n### 没有 LiveCodeBench 时\n- **评估结果失真**：使用的静态基准测试题早已被收录进模型训练数据，导致评分虚高，无法反映模型真实的解题能力（即“数据污染”问题）。\n- **考核维度单一**：仅能测试代码生成能力，缺乏对代码执行、自我修复及测试输出预测等关键实战技能的量化手段。\n- **时效性滞后**：题库更新缓慢，无法覆盖 LeetCode、CodeForces 等平台最近几个月出现的新颖算法题，难以检验模型处理前沿问题的能力。\n- **选型决策困难**：由于缺乏统一、动态的排行榜，团队难以客观对比不同开源模型的真实水平，导致基座模型选型依赖主观经验。\n\n### 使用 LiveCodeBench 后\n- **杜绝数据污染**：LiveCodeBench 持续抓取竞赛平台最新题目（如 2025 年发布的题目），确保测试集从未出现在训练数据中，评估结果真实可信。\n- **全维度能力画像**：除了代码生成，还能通过内置数据集精准评估模型的代码执行效率、自动修复 bug 的能力以及预测测试输出的准确性。\n- **紧跟技术前沿**：利用其细粒度的版本控制（如 `release_v6`），团队可立即使用包含 1000+ 道最新难题的数据集，验证模型对新兴考点的掌握度。\n- **科学决策依据**：参考官方实时更新的 Leaderboard，团队能快速锁定在特定场景下表现最优的开源模型，大幅降低试错成本。\n\nLiveCodeBench 通过提供无污染、持续更新且多维度的评估体系，帮助开发者戳破“刷分”泡沫，精准定位模型的真实代码智力水平。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiveCodeBench_LiveCodeBench_d177f128.png","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FLiveCodeBench_765d2ce1.png",null,"https:\u002F\u002Fgithub.com\u002FLiveCodeBench",[80],{"name":81,"color":82,"percentage":83},"Python","#3572A5",100,836,185,"2026-04-08T04:18:22","MIT","未说明","运行代码生成推理时推荐使用 NVIDIA GPU（通过 vllm），支持多卡并行（tensor_parallel_size），具体显存和 CUDA 版本取决于所选模型大小，README 中未明确指定最低要求",{"notes":91,"python":92,"dependencies":93},"强烈建议使用 uv 工具管理依赖和创建虚拟环境。推理部分主要依赖 vllm 库进行加速，支持通过 tensor_parallel_size 参数配置多 GPU 并行。评估指标计算使用了修改版的 apps benchmark checker。默认使用精简版数据集（code_generation_lite）以加快评估速度，若需使用完整数据集需添加 --not_fast 标志。","3.11",[94,95],"vllm","uv",[15],[98,99,100,101,102,103],"code-execution","code-generation","code-llms","code-repair","gpt-4","test-generation","2026-03-27T02:49:30.150509","2026-04-11T18:29:17.399067",[107,112,117,122,127,132],{"id":108,"question_zh":109,"answer_zh":110,"source_url":111},26369,"使用 datasets.load_dataset 加载 livecodebench\u002Fcode_generation_lite 时出现错误，提示找不到 main_cls，如何解决？","这个问题通常是因为从 Hugging Face 下载的 code_generation_lite.py 文件显示成功但实际上内容为空。解决方法是单独下载 livecodebench\u002Fcode_generation_lite\u002Fcode_generation_lite.py 文件，并替换本地旧的空内容文件。","https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002FLiveCodeBench\u002Fissues\u002F73",{"id":113,"question_zh":114,"answer_zh":115,"source_url":116},26370,"为什么排行榜上看不到 deepseek-r1、kimi1.6 或 Qwen 等模型的结果？","这是因为基准测试最近更新了新问题，而开源模型在新问题上的评估尚未重新运行。之前的版本排行榜可以在 https:\u002F\u002Flivecodebench.github.io\u002Fleaderboard_v5.html 查看。","https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002FLiveCodeBench\u002Fissues\u002F95",{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},26371,"是否可以使用公开测试用例进行测试时扩展（test-time scaling）并提交结果到排行榜？","可以参考 GPT-O3 的策略：采样大量解决方案，在不泄露任何私有测试信息的前提下使用某种方法重新排序，最后提交认为最好的那个解决方案。排行榜上标记为\"High\"的模型（如 Kimi-k1.6-IOI-high, O1-2024-12-17 (High) 等）可能采用了这种多次采样后提交最佳结果的策略。","https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002FLiveCodeBench\u002Fissues\u002F86",{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},26372,"加载 execution-v2 数据集时出现 ExpectedMoreSplitsError: {'train'} 错误，如何解决？","该问题已通过更新 README 文档修复。请查阅最新的 README 文档以获取正确的数据加载方式和配置说明。","https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002FLiveCodeBench\u002Fissues\u002F48",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},26373,"运行评估脚本时发现的问题数量与预期不符（例如显示 400 题而非 450 或 511 题），原因是什么？","这是因为使用了已不再维护的 `--not_fast` 选项。请移除该选项，这样生成的测试集题目数量才会正确（例如 511 题）。此外，官方排行榜主要基于温度 T=0.2、采样数 N=10 的评估结果，建议使用此设置进行提交和比较。","https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002FLiveCodeBench\u002Fissues\u002F40",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},26374,"用于生成论文中数据表（包含 numsteps, problem_id 等字段）的代码在哪里可以找到？","模型生成的代码样本可以在 https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flivecodebench\u002Fcode_generation_samples\u002Ftree\u002Fmain 找到。但是，项目方明确表示不计划发布用于生成特定评估数据集（evaluation set）的代码。","https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002FLiveCodeBench\u002Fissues\u002F8",[]]