[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-openai--prm800k":3,"tool-openai--prm800k":65},[4,18,28,36,44,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":24,"last_commit_at":25,"category_tags":26,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",150720,2,"2026-04-11T11:33:10",[14,13,27],"语言模型",{"id":29,"name":30,"github_repo":31,"description_zh":32,"stars":33,"difficulty_score":10,"last_commit_at":34,"category_tags":35,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[27,15,13,14],{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":24,"last_commit_at":42,"category_tags":43,"status":17},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[14,27],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":24,"last_commit_at":50,"category_tags":51,"status":17},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[15,16,52,53,13,54,27,14,55],"视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":17},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[27,16,54],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":83,"stars":88,"forks":89,"last_commit_at":90,"license":91,"difficulty_score":62,"env_os":79,"env_gpu":92,"env_ram":92,"env_deps":93,"category_tags":98,"github_topics":80,"view_count":24,"oss_zip_url":80,"oss_zip_packed_at":80,"status":17,"created_at":99,"updated_at":100,"faqs":101,"releases":102},6640,"openai\u002Fprm800k","prm800k","800,000 step-level correctness labels on LLM solutions to MATH problems","prm800k 是一个专为提升大语言模型数学推理能力而设计的开源数据集。它包含了 80 万条针对 MATH 数据集中数学题解答的“步骤级”正确性标注。与传统仅判断最终答案对错的方式不同，prm800k 将解题过程拆解为多个步骤，并对每一步的逻辑正确性进行独立评估。\n\n这一设计有效解决了模型在复杂推理中常见的“一步错、步步错”难题。通过引入过程监督（Process Supervision），开发者可以训练模型更精准地识别中间推理错误，从而显著减少幻觉并提高解题准确率。其核心技术亮点在于细粒度的反馈机制，让模型不仅能知道结果错了，还能明确知道是哪一步推导出现了偏差。\n\nprm800k 非常适合 AI 研究人员、大模型开发者以及专注于自然语言处理与逻辑推理领域的工程师使用。对于希望优化模型推理链条、探索新型训练范式（如强化学习中的奖励建模）的团队，这份提供了原始标注数据及详细标注指南的资源极具价值。虽然普通用户无法直接操作数据集，但未来基于此类技术优化的数学辅导工具或智能解题助手，将直接受益于 prm800k 带来的技术进步。","# PRM800K: A Process Supervision Dataset\n\n#### [[Blog Post]](https:\u002F\u002Fopenai.com\u002Fresearch\u002Fimproving-mathematical-reasoning-with-process-supervision) [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.20050)\n\nThis repository accompanies the paper [Let's Verify Step by Step](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.20050) and presents the PRM800K dataset introduced there. PRM800K is a process supervision dataset containing 800,000 step-level correctness labels for model-generated solutions to problems from the [MATH](https:\u002F\u002Fgithub.com\u002Fhendrycks\u002Fmath) dataset. More information on PRM800K and the project can be found in the paper.\n\nWe are releasing the raw labels as well as the instructions we gave labelers during phase 1 and phase 2 of the project. Example labels can be seen in the image below.\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopenai_prm800k_readme_4e1878785e24.png\" height=\"300\"\u002F>\n\u003C\u002Fp>\n\n\n## Data\n\nThe `data\u002F` folder contains our labels formatted as newline-delimited lists of `json` data. The data has been uploaded with [Git LFS](https:\u002F\u002Fgit-lfs.com\u002F), which you'll need to install in order to properly clone the repository.\n\nEach line represents 1 full solution sample and can contain many step-level labels. Here is one annotated line:\n\n\n```javascript\n{\n  \u002F\u002F UUID representing a particular labeler.\n  \"labeler\": \"340d89bc-f5b7-45e9-b272-909ba68ee363\",\n\n  \u002F\u002F The timestamp this trajectory was submitted.\n  \"timestamp\": \"2023-01-22T04:34:27.052924\",\n\n  \u002F\u002F In phase 2, we split our data collection into generations, using our best\n  \u002F\u002F PRM so far to pick which solutions to score in the next generation.\n  \u002F\u002F In phase 1, this value should always be null.\n  \"generation\": 9,\n\n  \u002F\u002F In each generation, we reserve some solutions for quality control. We serve\n  \u002F\u002F these solutions to every labeler, and check that they agree with our\n  \u002F\u002F gold labels.\n  \"is_quality_control_question\": false,\n\n  \u002F\u002F generation -1 was reserved for a set of 30 questions we served every\n  \u002F\u002F labeler in order to screen for base task performance.\n  \"is_initial_screening_question\": false,\n\n  \u002F\u002F Metadata about the question this solution is a response to.\n  \"question\": {\n    \u002F\u002F Text of the MATH problem being solved.\n    \"problem\": \"What is the greatest common factor of $20 !$ and $200,\\\\!000$?  (Reminder: If $n$ is a positive integer, then $n!$ stands for the product $1\\\\cdot 2\\\\cdot 3\\\\cdot \\\\cdots \\\\cdot (n-1)\\\\cdot n$.)\",\n    \u002F\u002F Ground truth solution from the MATH dataset.\n    \"ground_truth_solution\": \"The prime factorization of $200,000$ is $2^6 \\\\cdot 5^5$. Then count the number of factors of $2$ and $5$ in $20!$. Since there are $10$ even numbers, there are more than $6$ factors of $2$. There are $4$ factors of $5$. So the greatest common factor is $2^6 \\\\cdot 5^4=\\\\boxed{40,\\\\!000}$.\",\n    \u002F\u002F Ground truth answer.\n    \"ground_truth_answer\": \"40,\\\\!000\",\n\n    \u002F\u002F The full steps of the model-generated solution. This is only set for\n    \u002F\u002F phase 2 where we pre-generated all solutions that we labeled.\n    \"pre_generated_steps\": [\n      \"I want to find the largest positive integer that divides both $20 !$ and $200,\\\\!000$ evenly.\",\n      \"One way to do this is to factor both numbers into prime factors and look for the common ones.\",\n      \"I know that $200,\\\\!000 = 2^5\\\\cdot 10^4 = 2^9\\\\cdot 5^4$.\",\n      \"To find the prime factorization of $20 !$, I can use the fact that it is the product of all the positive integers from $1$ to $20$.\",\n      \"For each prime number $p$ between $1$ and $20$, I can count how many multiples of $p$ are in that range.\",\n      \"For example, there are $10$ multiples of $2$ between $1$ and $20$, namely $2, 4, 6, \\\\dots, 20$.\",\n      \"But there are also $5$ multiples of $4$, which is $2^2$, and $2$ multiples of $8$, which is $2^3$, and $1$ multiple of $16$, which is $2^4$.\",\n      \"So, the total power of $2$ in $20 !$ is $10 + 5 + 2 + 1 = 18$.\",\n      \"Similarly, there are $4$ multiples of $5$, namely $5, 10, 15, 20$, so the power of $5$ in $20 !$ is $4$.\",\n      \"There are $6$ multiples of $3$, namely $3, 6, 9, \\\\dots, 18$, but there are also $2$ multiples of $9$, which is $3^2$, so the power of $3$ in $20 !$ is $6 + 2 = 8$.\",\n      \"There are $2$ multiples of $7$, namely $7$ and $14$, so the power of $7$ in $20 !$ is $2$.\",\n      \"There are $1$ multiple of each of the other prime numbers $11, 13, 17$, and $19$, so the powers of those primes in $20 !$ are $1$ each.\",\n      \"Therefore, the prime factorization of $20 !$ is $2^{18}\\\\cdot 3^8\\\\cdot 5^4\\\\cdot 7^2\\\\cdot 11\\\\cdot 13\\\\cdot 17\\\\cdot 19$.\",\n      \"To find the greatest common factor of $20 !$ and $200,\\\\!000$, I need to take the lowest power of each common prime factor.\",\n      \"The only common prime factors are $2$ and $5$, and the lowest powers are $9$ and $4$, respectively.\",\n      \"So, the greatest common factor is $2^9\\\\cdot 5^4 = 512\\\\cdot 625 = 320,\\\\!000$.\\n\\n# Answer\\n\\n320,000\"\n    ],\n    \u002F\u002F The answer given as the end of the pre-generated solution. We can see\n    \u002F\u002F this solution is incorrect.\n    \"pre_generated_answer\": \"320,000\",\n    \u002F\u002F The score given by our PRM to this solution. This one isn't rated very\n    \u002F\u002F highly!\n    \"pre_generated_verifier_score\": 0.010779580529581414\n  },\n\n  \u002F\u002F The human data we collected for this solution, containing correctness\n  \u002F\u002F labels for each step of the solution.\n  \"label\": {\n    \"steps\": [\n      \u002F\u002F Each object here represents labels for one step of the solution.\n      {\n        \u002F\u002F Each step will contain one or more completions. These are candidate\n        \u002F\u002F steps the model output at this step of the trajectory. In phase 1,\n        \u002F\u002F we frequently collect labels on alternative steps, while in phase 2\n        \u002F\u002F we only collect labels on alternative steps after the first mistake,\n        \u002F\u002F so most completions lists are singletons.\n        \"completions\": [\n          {\n            \u002F\u002F Text of the step.\n            \"text\": \"I want to find the largest positive integer that divides both $20 !$ and $200,\\\\!000$ evenly.\",\n            \u002F\u002F The rating the labeler gave to this step. Can be -1, 0, or +1.\n            \u002F\u002F This is a 0 because it isn't incorrect, but it does not make\n            \u002F\u002F any progress.\n            \"rating\": 0,\n            \u002F\u002F The labeler can flag steps that they don't know how to label.\n            \u002F\u002F This is rarely used.\n            \"flagged\": null\n          }\n        ],\n        \u002F\u002F In phase 1, if all completions were rated -1, we allowed labelers to\n        \u002F\u002F write their own +1 step. This is null for all steps in phase 2.\n        \"human_completion\": null,\n        \u002F\u002F The index of the completion \"chosen\" at this step, or null if the\n        \u002F\u002F human_completion was used. You can reconstruct the solution\n        \u002F\u002F trajectory like:\n        \u002F\u002F [\n        \u002F\u002F     step[\"human_completion\"] if step[\"chosen_completion\"] is None\n        \u002F\u002F     else step[\"completions\"][step[\"chosen_completion\"]][\"text\"]\n        \u002F\u002F     for step in labeled_solution[\"label\"][\"steps\"]\n        \u002F\u002F ]\n        \"chosen_completion\": 0\n      },\n      {\n        \"completions\": [\n          {\n            \"text\": \"One way to do this is to factor both numbers into prime factors and look for the common ones.\",\n            \"rating\": 0,\n            \"flagged\": null\n          }\n        ],\n        \"human_completion\": null,\n        \"chosen_completion\": 0\n      },\n      {\n        \u002F\u002F Some steps contain multiple alternative completions, and each one\n        \u002F\u002F gets a rating.\n        \"completions\": [\n          {\n            \"text\": \"I know that $200,\\\\!000 = 2^5\\\\cdot 10^4 = 2^9\\\\cdot 5^4$.\",\n            \"rating\": -1,\n            \"flagged\": null\n          },\n          {\n            \"text\": \"To factor $20 !$, I can use the fact that every factorial is a multiple of every number less than or equal to it.\",\n            \"rating\": 0,\n            \"flagged\": false\n          },\n          {\n            \"text\": \"I can use a factor tree to find the prime factors of $200,\\\\!000$: $200,\\\\!000 = 2^5\\\\cdot 10^4 = 2^5\\\\cdot 2^4\\\\cdot 5^4 = 2^9\\\\cdot 5^4$.\",\n            \"rating\": -1,\n            \"flagged\": false\n          },\n          {\n            \"text\": \"I can use a factor tree to find the prime factors of $200,\\\\!000$.\",\n            \"rating\": 0,\n            \"flagged\": false\n          },\n          {\n            \"text\": \"To factor $20 !$, I can use the fact that any factorial is divisible by all the primes less than or equal to the input.\",\n            \"rating\": 0,\n            \"flagged\": false\n          }\n        ],\n        \"human_completion\": null,\n        \"chosen_completion\": null\n      }\n    ],\n    \u002F\u002F Total time in milliseconds spent on labeling this solution.\n    \"total_time\": 278270,\n    \u002F\u002F Final result of labeling this solution. Will be one of:\n    \u002F\u002F   - \"found_error\": In phase 2 we stop labeling a solution after the\n    \u002F\u002F                    first error is found.\n    \u002F\u002F   - \"solution\": We reached a step that concluded in the correct answer\n    \u002F\u002F                 to the problem.\n    \u002F\u002F   - \"bad_problem\": The labeler reported the problem as broken.\n    \u002F\u002F   - \"give_up\": The labeler was stuck (the problem was taking too long,\n    \u002F\u002F                or the instructions were unclear) and moved onto the\n    \u002F\u002F                next problem.\n    \"finish_reason\": \"found_error\"\n  }\n}\n```\n\n\n## Instructions\n\nThe `instructions\u002F` folder contains the instructions documents we gave to\nlabelers during each phase of the project.\n\n\n## Answer Grading\n\nThe `grading\u002F` folder contains the python grading logic we used for determining if a model-outputted answer correctly matched\nthe ground truth answer in Hendrycks' MATH dataset. We build off of Hendrycks' math normalization logic in `math_normalize.py`\nand use sympy to check for equality of expressions in `grader.py`. We recommend using `grader.grade_answer(model_answer, gt_answer)`\nwhere both answers are strings to determine if a solution is correct or not.\n\nAnswer grading is difficult in general. This grading logic is designed to be conservative and will sometimes reject correct\nanswers, though it does so less frequently than the normalization logic from MATH. Our logic might sometimes admit incorrect\nanswers, though we've put effort into minimizing this.\n\n\n## MATH Splits\n\nAs explained in Let's Verify Step by Step, we use a nonstandard MATH train\u002Ftest split.\n\n> In order to avoid the risk of over-fitting on the 7,500 MATH training problems, we expanded the training set to include 4,500 MATH test split problems. We therefore evaluate our models only on the remaining 500 held-out problems. We selected these 500 test problems uniformly at random, and we believe they are representative of the test set as a whole.\n\nThe `math_splits\u002F` folder contains our selected splits in the `train.jsonl` and `test.jsonl` files. You'll need [Git LFS](https:\u002F\u002Fgit-lfs.com\u002F) to properly clone these files.\n\n\n## Scored Samples\n\nWe release all large-scale model samples used to evaluate the large-scale ORM and PRM, corresponding to Figure 3 in the paper. Each test problem has to 1860 scored samples. Solutions that failed to reach an answer within 1024 tokens were discarded, resulting in less than 1860 samples on some problems. We account for this in the best-of-N evaluation logic. \n\nEvaluate the PRM:\n\n```bash\npython eval\u002Feval.py --method prm\n```\n\nEvaluate the ORM:\n\n```bash\npython eval\u002Feval.py --method orm\n```\n\n\n## Citation\n\nPlease use the below BibTeX entry to cite this dataset:\n\n```\n@article{lightman2023lets,\n      title={Let's Verify Step by Step}, \n      author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl},\n      journal={arXiv preprint arXiv:2305.20050},\n      year={2023}\n}\n```\n","# PRM800K：一个过程监督数据集\n\n#### [[博客文章]](https:\u002F\u002Fopenai.com\u002Fresearch\u002Fimproving-mathematical-reasoning-with-process-supervision) [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.20050)\n\n本仓库配合论文《让我们逐步验证》（Let's Verify Step by Step）[https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.20050]，展示了其中介绍的 PRM800K 数据集。PRM800K 是一个过程监督数据集，包含针对来自 [MATH](https:\u002F\u002Fgithub.com\u002Fhendrycks\u002Fmath) 数据集问题的模型生成解答的 80 万条步骤级正确性标签。有关 PRM800K 及该项目的更多信息，请参阅该论文。\n\n我们同时发布了原始标签，以及在项目第一阶段和第二阶段中提供给标注人员的说明。示例标签可见下图。\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopenai_prm800k_readme_4e1878785e24.png\" height=\"300\"\u002F>\n\u003C\u002Fp>\n\n\n## 数据\n\n`data\u002F` 文件夹中包含了以换行符分隔的 `json` 数据列表格式的标签。这些数据已通过 [Git LFS](https:\u002F\u002Fgit-lfs.com\u002F) 上传，您需要安装 Git LFS 才能正确克隆该仓库。\n\n每行代表一个完整的解答样本，可能包含多条步骤级标签。以下是一条带注释的示例行：\n\n\n```javascript\n{\n  \u002F\u002F 唯一标识符，代表特定的标注者。\n  \"labeler\": \"340d89bc-f5b7-45e9-b272-909ba68ee363\",\n\n  \u002F\u002F 该轨迹提交的时间戳。\n  \"timestamp\": \"2023-01-22T04:34:27.052924\",\n\n  \u002F\u002F 在第二阶段，我们将数据收集分为多个世代，利用目前表现最好的 PRM 来挑选下一世代需要评分的解答。\n  \u002F\u002F 在第一阶段，此值应始终为 null。\n  \"generation\": 9,\n\n  \u002F\u002F 在每个世代中，我们会预留一部分解答用于质量控制。这些解答会提供给每位标注者，并与我们的黄金标准标签进行比对。\n  \"is_quality_control_question\": false,\n\n  \u002F\u002F 第 -1 世代专门用于一组 30 道题目，我们将其提供给每位标注者，以筛选其基础任务的表现。\n  \"is_initial_screening_question\": false,\n\n  \u002F\u002F 关于此解答所回应问题的元数据。\n  \"question\": {\n    \u002F\u002F 正在求解的 MATH 题目文本。\n    \"problem\": \"$20 !$ 和 $200,\\\\!000$ 的最大公约数是多少？（提示：若 $n$ 是正整数，则 $n!$ 表示乘积 $1\\\\cdot 2\\\\cdot 3\\\\cdot \\\\cdots \\\\cdot (n-1)\\\\cdot n$。）\",\n    \u002F\u002F MATH 数据集中提供的正确答案。\n    \"ground_truth_solution\": \"因为 $200,\\\\!000$ 的质因数分解为 $2^6 \\\\cdot 5^5$。接下来统计 $20 !$ 中 $2$ 和 $5$ 的指数。由于有 10 个偶数，$2$ 的指数必然大于 6。而 $5$ 的指数为 4。因此，最大公约数为 $2^6 \\\\cdot 5^4=\\\\boxed{40,\\\\!000}$。\",\n    \u002F\u002F 正确答案。\n    \"ground_truth_answer\": \"40,\\\\!000\",\n\n    \u002F\u002F 模型生成解答的完整步骤。此字段仅在第二阶段设置，因为我们预先生成了所有要标注的解答。\n    \"pre_generated_steps\": [\n      \"我想找到能够同时整除 $20 !$ 和 $200,\\\\!000$ 的最大正整数。\",\n      \"一种方法是将这两个数分别分解成质因数，然后找出它们共有的质因数。\",\n      \"我知道 $200,\\\\!000 = 2^5\\\\cdot 10^4 = 2^9\\\\cdot 5^4$。\",\n      \"为了求出 $20 !$ 的质因数分解，我可以利用它是由 1 到 20 的所有正整数相乘得到的事实。\",\n      \"对于 1 到 20 之间的每个质数 $p$，我可以统计该范围内有多少个 $p$ 的倍数。\",\n      \"例如，1 到 20 之间有 10 个 2 的倍数，分别是 $2, 4, 6, \\\\dots, 20$。\",\n      \"但同时也有 5 个 4 的倍数，即 $2^2$；还有 2 个 8 的倍数，即 $2^3$；以及 1 个 16 的倍数，即 $2^4$。\",\n      \"因此，$20 !$ 中 $2$ 的总指数是 $10 + 5 + 2 + 1 = 18$。\",\n      \"类似地，1 到 20 之间有 4 个 5 的倍数，分别是 $5, 10, 15, 20$，所以 $20 !$ 中 $5$ 的指数是 4。\",\n      \"此外，还有 6 个 3 的倍数，分别是 $3, 6, 9, \\\\dots, 18$，但同时也有 2 个 9 的倍数，即 $3^2$，因此 $20 !$ 中 $3$ 的指数是 $6 + 2 = 8$。\",\n      \"还有 2 个 7 的倍数，分别是 $7$ 和 $14$，所以 $20 !$ 中 $7$ 的指数是 2。\",\n      \"而对于其他质数 $11, 13, 17$ 和 $19$，每个只有一处倍数，因此它们在 $20 !$ 中的指数都是 1。\",\n      \"综上所述，$20 !$ 的质因数分解为 $2^{18}\\\\cdot 3^8\\\\cdot 5^4\\\\cdot 7^2\\\\cdot 11\\\\cdot 13\\\\cdot 17\\\\cdot 19$。\",\n      \"要找到 $20 !$ 和 $200,\\\\!000$ 的最大公约数，我需要取它们共有的质因数中指数最小的那个。\",\n      \"两者共有的质因数只有 $2$ 和 $5$，而它们的最小指数分别是 $9$ 和 $4$。\",\n      \"因此，最大公约数就是 $2^9\\\\cdot 5^4 = 512\\\\cdot 625 = 320,\\\\!000$。\\n\\n# 答案\\n\\n320,000\"\n    ],\n    \u002F\u002F 预先生成解答的最终答案。我们可以看到这个答案是错误的。\n    \"pre_generated_answer\": \"320,000\",\n    \u002F\u002F 我们的 PRM 给出的对该解答的评分。这个得分非常低！\n    \"pre_generated_verifier_score\": 0.010779580529581414\n  },\n\n\u002F\u002F 我们为该解决方案收集的人工数据，包含每个步骤的正确性标签。\n  \"label\": {\n    \"steps\": [\n      \u002F\u002F 每个对象代表解决方案中一个步骤的标签。\n      {\n        \u002F\u002F 每个步骤会包含一个或多个补全。这些是模型在该轨迹步骤中生成的候选步骤。在第一阶段，我们经常收集替代步骤的标签；而在第二阶段，我们只在第一次错误之后才收集替代步骤的标签，因此大多数补全列表通常只有一个条目。\n        \"completions\": [\n          {\n            \u002F\u002F 步骤的文本内容。\n            \"text\": \"我想找到能同时整除 $20 !$ 和 $200,\\\\!000$ 的最大正整数。\",\n            \u002F\u002F 标注者对该步骤的评分，取值为 -1、0 或 +1。\n            \u002F\u002F 这里评分为 0，因为该步骤没有错误，但也没有取得任何进展。\n            \"rating\": 0,\n            \u002F\u002F 标注者可以标记他们不确定如何标注的步骤。这种情况很少使用。\n            \"flagged\": null\n          }\n        ],\n        \u002F\u002F 在第一阶段，如果所有补全的评分均为 -1，我们会允许标注者自行写出一个评分为 +1 的步骤。但在第二阶段，所有步骤的 human_completion 均为 null。\n        \"human_completion\": null,\n        \u002F\u002F 该步骤“选择”的补全索引，若使用了 human_completion，则为 null。可以通过以下方式重建解决方案轨迹：\n        \u002F\u002F [\n        \u002F\u002F     step[\"human_completion\"] 如果 step[\"chosen_completion\"] 为 None\n        \u002F\u002F     否则 step[\"completions\"][step[\"chosen_completion\"]][\"text\"]\n        \u002F\u002F     对于 labeled_solution[\"label\"][\"steps\"] 中的每一个 step\n        \u002F\u002F ]\n        \"chosen_completion\": 0\n      },\n      {\n        \"completions\": [\n          {\n            \"text\": \"一种方法是将这两个数分解成质因数，然后找出它们的公共质因数。\",\n            \"rating\": 0,\n            \"flagged\": null\n          }\n        ],\n        \"human_completion\": null,\n        \"chosen_completion\": 0\n      },\n      {\n        \u002F\u002F 有些步骤包含多个替代补全，每个补全都会被单独评分。\n        \"completions\": [\n          {\n            \"text\": \"我知道 $200,\\\\!000 = 2^5\\\\cdot 10^4 = 2^9\\\\cdot 5^4$。\",\n            \"rating\": -1,\n            \"flagged\": null\n          },\n          {\n            \"text\": \"为了分解 $20 !$，我可以利用这样一个事实：任何阶乘都是其小于或等于该数的所有整数的倍数。\",\n            \"rating\": 0,\n            \"flagged\": false\n          },\n          {\n            \"text\": \"我可以用因数树来找出 $200,\\\\!000$ 的质因数：$200,\\\\!000 = 2^5\\\\cdot 10^4 = 2^5\\\\cdot 2^4\\\\cdot 5^4 = 2^9\\\\cdot 5^4$。\",\n            \"rating\": -1,\n            \"flagged\": false\n          },\n          {\n            \"text\": \"我可以用因数树来找出 $200,\\\\!000$ 的质因数。\",\n            \"rating\": 0,\n            \"flagged\": false\n          },\n          {\n            \"text\": \"为了分解 $20 !$，我可以利用这样一个事实：任何阶乘都能被小于或等于输入数的所有质数整除。\",\n            \"rating\": 0,\n            \"flagged\": false\n          }\n        ],\n        \"human_completion\": null,\n        \"chosen_completion\": null\n      }\n    ],\n    \u002F\u002F 标注该解决方案所花费的总时间（以毫秒为单位）。\n    \"total_time\": 278270,\n    \u002F\u002F 标注该解决方案的最终结果。可能的取值包括：\n    \u002F\u002F   - \"found_error\": 在第二阶段，一旦发现第一个错误，我们就停止对该解决方案的标注。\n    \u002F\u002F   - \"solution\": 我们到达了一个得出问题正确答案的步骤。\n    \u002F\u002F   - \"bad_problem\": 标注者报告该问题存在缺陷。\n    \u002F\u002F   - \"give_up\": 标注者遇到困难（问题耗时过长或说明不清晰），于是转到下一个问题。\n    \"finish_reason\": \"found_error\"\n  }\n}\n```\n\n\n\n\n## 使用说明\n\n`instructions\u002F` 文件夹包含了我们在项目各阶段提供给标注者的操作指南文档。\n\n\n## 答案评分\n\n`grading\u002F` 文件夹包含用于判断模型输出的答案是否与 Hendrycks 的 MATH 数据集中的标准答案一致的 Python 评分逻辑。我们基于 `math_normalize.py` 中的 Hendrycks 数学标准化逻辑，并借助 sympy 库在 `grader.py` 中检查表达式的等价性。建议使用 `grader.grade_answer(model_answer, gt_answer)` 方法，其中两个参数均为字符串，以确定解题过程是否正确。\n\n总体而言，答案评分是一项复杂的工作。本评分逻辑设计较为保守，有时可能会误判正确的答案，不过这种误判的发生频率低于 MATH 数据集中的标准化逻辑。此外，我们的逻辑也可能偶尔接受错误的答案，但我们已尽力减少此类情况的发生。\n\n\n## MATH 数据集划分\n\n如《让我们逐步验证》所述，我们采用了非标准的 MATH 数据集训练集和测试集划分。\n\n> 为了避免对 MATH 训练集中的 7,500 道题目过度拟合的风险，我们将训练集扩展至包含 4,500 道 MATH 测试集题目。因此，我们仅在剩余的 500 道保留题目上评估模型性能。这 500 道测试题是我们随机均匀抽取的，我们认为它们能够代表整个测试集的特性。\n\n`math_splits\u002F` 文件夹中包含了我们选定的划分结果，分别存储在 `train.jsonl` 和 `test.jsonl` 文件中。要正确克隆这些文件，您需要使用 [Git LFS](https:\u002F\u002Fgit-lfs.com\u002F)。\n\n\n## 打分样本\n\n我们公开了用于评估大规模 ORM 和 PRM 模型的所有大规模模型样本，这些样本对应于论文中的图 3。每道测试题都有 1,860 个打分样本。对于那些在 1,024 个 token 内未能得出答案的解题过程，我们将其舍弃，因此部分题目上的有效样本数量不足 1,860。我们在最佳 N 次采样评估逻辑中考虑了这一因素。\n\n评估 PRM 模型：\n\n```bash\npython eval\u002Feval.py --method prm\n```\n\n评估 ORM 模型：\n\n```bash\npython eval\u002Feval.py --method orm\n```\n\n\n## 引用\n\n请使用以下 BibTeX 条目引用本数据集：\n\n```\n@article{lightman2023lets,\n      title={Let's Verify Step by Step}, \n      author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl},\n      journal={arXiv preprint arXiv:2305.20050},\n      year={2023}\n}\n```","# PRM800K 快速上手指南\n\nPRM800K 是一个过程监督数据集，包含 80 万条针对模型生成的数学问题解答的步骤级正确性标签。本指南帮助开发者快速配置环境并加载数据。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux, macOS 或 Windows (WSL 推荐)\n- **Python**: 3.8 或更高版本\n- **Git**: 必须安装\n- **Git LFS**: **必须安装**，因为数据集文件通过 Git LFS 存储，未安装会导致克隆后文件仅为指针文本。\n\n### 前置依赖\n建议创建虚拟环境并安装基础依赖：\n\n```bash\npython -m venv venv\nsource venv\u002Fbin\u002Factivate  # Windows 用户请使用: venv\\Scripts\\activate\n\npip install --upgrade pip\npip install jsonlines sympy\n```\n\n> **国内加速提示**: 推荐使用清华或阿里镜像源加速 pip 安装：\n> ```bash\n> pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple jsonlines sympy\n> ```\n\n## 安装步骤\n\n### 1. 安装 Git LFS\n在克隆仓库前，请确保已安装并初始化 Git LFS。\n\n**Ubuntu\u002FDebian:**\n```bash\nsudo apt-get update\nsudo apt-get install git-lfs\ngit lfs install\n```\n\n**macOS (Homebrew):**\n```bash\nbrew install git-lfs\ngit lfs install\n```\n\n**Windows:**\n下载官方安装包或使用 `winget`:\n```powershell\nwinget install GitHub.GitLFS\ngit lfs install\n```\n\n### 2. 克隆仓库\n执行以下命令克隆项目并拉取大文件数据：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fopenai\u002Fprm800k.git\ncd prm800k\n# 如果克隆时未自动拉取大文件，手动执行以下命令\ngit lfs pull\n```\n\n> **注意**: 数据文件较大，请确保网络连接稳定。如遇下载缓慢，可尝试配置 Git 代理。\n\n## 基本使用\n\n### 加载与解析数据\n数据位于 `data\u002F` 目录下，格式为换行分隔的 JSON (`jsonl`)。每一行代表一个完整的解答样本，包含多个步骤的标签。\n\n以下是使用 Python 读取并解析单个样本的示例：\n\n```python\nimport json\n\n# 替换为实际的数据文件路径，例如 data\u002Fphase1.jsonl 或 data\u002Fphase2.jsonl\ndata_file = \"data\u002Fphase1.jsonl\" \n\nwith open(data_file, \"r\", encoding=\"utf-8\") as f:\n    # 读取第一行作为示例\n    line = f.readline()\n    sample = json.loads(line)\n\n# 访问问题内容\nproblem_text = sample[\"question\"][\"problem\"]\nprint(f\"问题: {problem_text}\")\n\n# 访问人工标注的步骤信息\nsteps = sample[\"label\"][\"steps\"]\nprint(f\"步骤数量: {len(steps)}\")\n\n# 遍历步骤，查看第一个步骤的评分\nif steps:\n    first_step = steps[0]\n    # 获取被选中的完成项 (chosen_completion)\n    if first_step[\"chosen_completion\"] is not None:\n        completion = first_step[\"completions\"][first_step[\"chosen_completion\"]]\n        print(f\"步骤内容: {completion['text']}\")\n        print(f\"评分 (-1, 0, 1): {completion['rating']}\")\n    elif first_step[\"human_completion\"]:\n        print(f\"人工修正步骤: {first_step['human_completion']['text']}\")\n```\n\n### 答案验证工具\n项目提供了用于判断模型输出答案是否与标准答案一致的评分逻辑（基于 SymPy）。\n\n```python\nfrom grading.grader import grade_answer\n\nmodel_answer = \"40,000\"\ngt_answer = \"40,\\\\!000\" # 来自数据集的标准答案格式\n\nis_correct = grade_answer(model_answer, gt_answer)\nprint(f\"答案是否正确：{is_correct}\")\n```\n\n### 评估脚本\n若需复现论文中的 ORM 或 PRM 评估结果，可直接运行提供的评估脚本：\n\n```bash\n# 评估 PRM\npython eval\u002Feval.py --method prm\n\n# 评估 ORM\npython eval\u002Feval.py --method orm\n```","某教育科技团队正在开发一款针对高中生的 AI 数学辅导助手，旨在不仅给出答案，更要提供逻辑严密的逐步解题过程。\n\n### 没有 prm800k 时\n- **错误难以定位**：模型常出现“步骤正确但结论错误”或“中间一步出错导致全盘皆输”的情况，开发者只能依赖最终答案对错来训练，无法精准识别具体哪一步推理断裂。\n- **幻觉隐蔽性强**：在长链条推导中，模型容易编造看似合理的虚假公式（如错误的质因数分解），传统监督方式难以在生成过程中及时拦截这些细微的逻辑幻觉。\n- **迭代效率低下**：优化模型需要人工逐行检查大量解题样本以标注错误步骤，耗时耗力且标准不一，导致模型在复杂数学推理上的进步极其缓慢。\n\n### 使用 prm800k 后\n- **实现过程级监督**：利用 prm800k 提供的 80 万条步骤级正确性标签，团队可以直接训练奖励模型对每一步推导进行打分，精准定位并修正逻辑断层。\n- **有效抑制幻觉**：通过引入数据集中人类标注的细粒度反馈，模型学会了在每一步自我验证（如核对质因数计数），显著减少了中间步骤的胡编乱造现象。\n- **加速模型进化**：基于高质量的过程监督数据，团队无需重复造轮子进行人工标注，快速构建出能“步步为营”的高可靠性数学求解器，大幅缩短研发周期。\n\nprm800k 的核心价值在于将 AI 数学能力的评估维度从粗糙的“结果导向”升级为精细的“过程导向”，让模型真正学会像人类一样严谨推理。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopenai_prm800k_4e187878.png","openai","OpenAI","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fopenai_1960bbf4.png","",null,"https:\u002F\u002Fopenai.com\u002F","https:\u002F\u002Fgithub.com\u002Fopenai",[84],{"name":85,"color":86,"percentage":87},"Python","#3572A5",100,2114,126,"2026-04-10T12:28:10","MIT","未说明",{"notes":94,"python":92,"dependencies":95},"该项目主要是一个数据集仓库，而非直接运行的模型代码库。必须安装 Git LFS 才能正确克隆包含数据文件（data\u002F）和数学题目分割文件（math_splits\u002F）的仓库。仓库中包含用于答案评分的 Python 逻辑（grading\u002F），依赖 sympy 库进行表达式相等性检查。评估脚本（eval\u002Feval.py）用于评估 PRM 和 ORM，但具体的深度学习框架依赖（如 PyTorch）未在 README 中明确列出，需参考原论文或相关模型实现。",[96,97],"Git LFS","sympy",[27,16],"2026-03-27T02:49:30.150509","2026-04-11T23:23:13.699971",[],[]]