[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-hijkzzz--Awesome-LLM-Strawberry":3,"tool-hijkzzz--Awesome-LLM-Strawberry":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":76,"owner_location":76,"owner_email":79,"owner_twitter":76,"owner_website":80,"owner_url":81,"languages":76,"stars":82,"forks":83,"last_commit_at":84,"license":85,"difficulty_score":86,"env_os":87,"env_gpu":88,"env_ram":88,"env_deps":89,"category_tags":92,"github_topics":93,"view_count":10,"oss_zip_url":76,"oss_zip_packed_at":76,"status":16,"created_at":102,"updated_at":103,"faqs":104,"releases":105},3306,"hijkzzz\u002FAwesome-LLM-Strawberry","Awesome-LLM-Strawberry","A collection of LLM papers, blogs, and projects, with a focus on OpenAI o1 🍓 and reasoning techniques.","Awesome-LLM-Strawberry 是一个专注于大语言模型推理能力的开源资源合集，核心围绕 OpenAI o1（代号 Strawberry）及各类前沿推理技术展开。它系统性地整理了相关的学术论文、技术博客、官方文档以及复现项目，旨在帮助从业者快速掌握从基础理论到最新架构突破的全貌。\n\n面对大模型领域推理技术迭代极快、信息分散且难以追踪的痛点，Awesome-LLM-Strawberry 提供了持续更新的“一站式”导航。它不仅收录了 OpenAI 关于 o1、o3 系列的官方解读，还涵盖了 Google DeepMind、DeepSeek、月之暗面等机构在推理模型上的最新进展，甚至包括对 o1 架构的深度逆向工程分析和强化学习训练技巧探讨。\n\n该资源库特别适合 AI 研究人员、算法工程师以及对大模型底层机制感兴趣的开发者使用。无论是希望复现 o1 推理能力的团队，还是想要了解“思维链”、“过程监督”等独特技术亮点的学者，都能从中找到高价值的参考依据。通过汇聚全球顶尖的智慧成果，Awesome-LLM-Strawberry 成为了探索下一代具备深度思考能力 AI 系统的重要窗口。","  # Awesome LLM Strawberry (OpenAI o1)\n[![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome) ![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhijkzzz\u002FAwesome-LLM-Strawberry?color=yellow) ![GitHub forks](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fhijkzzz\u002FAwesome-LLM-Strawberry?color=9cf) [![GitHub license](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fhijkzzz\u002FAwesome-LLM-Strawberry)](https:\u002F\u002Fgithub.com\u002Fhijkzzz\u002FAwesome-LLM-Strawberry\u002Fblob\u002Fmain\u002FLICENSE)\n\nThis is a collection of research papers & blogs for **OpenAI Strawberry(o1) and Reasoning**.\n\nAnd the repository will be continuously updated to track the frontier of LLM Reasoning.\n\n## OpenAI Docs\n- [https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fguides\u002Freasoning](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fguides\u002Freasoning)\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_015542bbebcd.png\" width=\"600px\">\n\n## News\n- [OpenAI] [Introducing deep research](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-deep-research\u002F)\n- [OpenAI] [o3 preview & o3 mini](https:\u002F\u002Fopenai.com\u002F12-days\u002F)\n- [OpenAI] [Introducing ChatGPT Pro](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-chatgpt-pro\u002F)\n- [Google DeepMind] [Gemini 2.0 Flash Thinking](https:\u002F\u002Fx.com\u002FJeffDean\u002Fstatus\u002F1869789813232341267)\n- [Ilya Sutskever] [AI with reasoning power will be less predictable](https:\u002F\u002Fwww.reuters.com\u002Ftechnology\u002Fartificial-intelligence\u002Fai-with-reasoning-power-will-be-less-predictable-ilya-sutskever-says-2024-12-14\u002F)\n- [SemianAlysis] [Scaling Laws – O1 Pro Architecture, Reasoning Training Infrastructure, Orion and Claude 3.5 Opus “Failures” ](https:\u002F\u002Fsemianalysis.com\u002F2024\u002F12\u002F11\u002Fscaling-laws-o1-pro-architecture-reasoning-training-infrastructure-orion-and-claude-3-5-opus-failures\u002F)\n- [DeepSeek] [DeepSeek-R1-Lite-Preview is now live: unleashing supercharged reasoning power!](https:\u002F\u002Fapi-docs.deepseek.com\u002Fnews\u002Fnews1120)\n- [Moonshoot] [数学对标o1系列，搜索再次进化，Kimi 新推理模型与你一起拓展智能边界](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002Fg4DltigncX-4sfaQ6Qn1zA)\n- [Moonshoot] [Kimi 发布视觉思考模型 k1，多项理科测试行业领先](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002F8cip3dehL8OIfZSnbZ1ftQ)\n- [InternLM] [强推理模型书生InternThinker开放体验：自主生成高智力密度数据、具备元动作思考能力](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002Fl7fdHlETvhKgZmUl23EiRA)\n- [新智元] [万字独家爆光，首揭o1 pro架构！惊人反转，Claude 3.5 Opus没失败？](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FLozJEE1sAAYAOrEFDVb6mg)\n\n## Blogs\n- [OpenAI] [Learning to Reason with LLMs](https:\u002F\u002Fopenai.com\u002Findex\u002Flearning-to-reason-with-llms\u002F)\n- [OpenAI] [OpenAI o1-mini Advancing cost-efficient reasoning](https:\u002F\u002Fopenai.com\u002Findex\u002Fopenai-o1-mini-advancing-cost-efficient-reasoning)\n- [OpenAI] [Finding GPT-4’s mistakes with GPT-4](https:\u002F\u002Fopenai.com\u002Findex\u002Ffinding-gpt4s-mistakes-with-gpt-4\u002F)\n- [ARC-AGI] [OpenAI o3 Breakthrough High Score on ARC-AGI-Pub](https:\u002F\u002Farcprize.org\u002Fblog\u002Foai-o3-pub-breakthrough)\n- [Anthropic] [Building effective agents](https:\u002F\u002Fwww.anthropic.com\u002Fresearch\u002Fbuilding-effective-agents)\n- [hijkzzz] [Stabilizing MoE RL Without Router Replay: The Online IcePop Solution](https:\u002F\u002Fhijkzzz.notion.site\u002Fonline-ice-pop)\n- [hijkzzz] [REINFORCE++-baseline is all you need in RLVR](https:\u002F\u002Fmedium.com\u002F@janhu9527\u002Freinforce-baseline-is-all-you-need-in-rlvr-f5406930aa85)\n- [hijkzzz] [Exploring OpenAI O1 Model Replication](https:\u002F\u002Fhijkzzz.notion.site\u002Fexploring-openai-o1-model-replication?pvs=74)\n- [Nathan Lambert] [OpenAI’s Strawberry, LM self-talk, inference scaling laws, and spending more on inference](https:\u002F\u002Fwww.interconnects.ai\u002Fp\u002Fopenai-strawberry-and-inference-scaling-laws)\n- [Nathan Lambert] [Reverse engineering OpenAI’s o1](https:\u002F\u002Fwww.interconnects.ai\u002Fp\u002Freverse-engineering-openai-o1)\n- [Andreas Stuhlmüller, jungofthewon] [Supervise Process, not Outcomes](https:\u002F\u002Fwww.alignmentforum.org\u002Fposts\u002FpYcFPMBtQveAjcSfH\u002Fsupervise-process-not-outcomes)\n- [Nouha Dziri] [Have o1 Models Cracked Human Reasoning?](https:\u002F\u002Fsubstack.com\u002Fhome\u002Fpost\u002Fp-148782195)\n- [Rishabh Agarwal] [Improving LLM Reasoning using Self-generated data: RL and Verifiers](https:\u002F\u002Frosanneliu.com\u002Fdlctfs\u002Fdlct_240531.pdf)\n- [Wei Shen] [Generalization Progress in RLHF: Insights into the Impact of Reward Models and PPO](https:\u002F\u002Fswtheking.notion.site\u002F4e0cbb325aaf458da710f0b36dbb239c?v=c9231e8c988b4d66a1d2dc34df4cf7b5)\n- [Dominater069] [Codeforces - Analyzing how good O1-Mini actually is](https:\u002F\u002Fcodeforces.com\u002Fblog\u002Fentry\u002F133887)\n- [Tibor Blaho] [Summary of what we have learned during AMA hour with the OpenAI o1 team](https:\u002F\u002Ftwitter-thread.com\u002Ft\u002F1834686946846597281)\n\n## Talks\n- [Noam Brown] [Parables on the Power of Planning in AI: From Poker to Diplomacy](https:\u002F\u002Fwww.youtube.com\u002Fwatch?app=desktop&v=eaAonE58sLU)\n- [Noam Brown] [OpenAI o1 and Teaching LLMs to Reason Better](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=jPluSXJpdrA&t=1669s)\n- [Hyung Won Chung] [Don't teach. Incentivize.](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=kYWUEV_e2ss)\n\n## Courses\n- [DeepLearning.AI] [Reasoning with o1](https:\u002F\u002Flearn.deeplearning.ai\u002Fcourses\u002Freasoning-with-o1)\n\n## Twitter\n\u003Cdetails>\n\u003Csummary>OpenAI Developers\u003C\u002Fsummary>\n\n- [All the questions addressed by the API team during the December 17, 2024 AMA](https:\u002F\u002Fcommunity.openai.com\u002Ft\u002Fall-the-questions-addressed-by-the-api-team-during-the-december-17-2024-ama\u002F1059780)\n- [We’re hosting an AMA for developers from 10–11 AM PT today.](https:\u002F\u002Fx.com\u002FOpenAIDevs\u002Fstatus\u002F1834608585151594537)\n- [Today we previewed Reinforcement Fine-Tuning](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1865136373491208674)\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_1a61f3b92629.png\" width=\"360px\">\n\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Noam Brown\u003C\u002Fsummary>\n\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_732ef0b2fde5.png\" width=\"360px\">\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_b1bd53563fa7.png\" width=\"360px\">\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_4d37fb0e6b0d.png\" width=\"360px\">\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_52aef3db34db.png\" width=\"360px\">\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Jason Wei\u003C\u002Fsummary>\n\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_a8b45242ddf7.png\" width=\"360px\">\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_216361e31100.png\" width=\"360px\">\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_0a290c946a85.png\" width=\"360px\">\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Others\u003C\u002Fsummary>\n\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_2dab0b45f21b.png\" width=\"360px\">\n\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_05f1b4922682.png\" width=\"360px\">\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_64fcd129184a.png\" width=\"360px\">\n\n\u003C\u002Fdetails>\n\n## Open-source\n### Models\n- [Alibaba Qwen Team] [Qwen3](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3)\n- [Alibaba Qwen Team] [QwQ](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwQ-32B)\n- [Alibaba Qwen Team] [QvQ](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQVQ-72B-Preview)\n- [DeepSeek] [DeepSeek R1](https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai\u002FDeepSeek-R1)\n- [NVIDIA] [Nemotron-Research-Reasoning-Qwen-1.5B](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002FNemotron-Research-Reasoning-Qwen-1.5B)\n- [Skywork] [Skywork R1V2](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FSkywork\u002Fskywork-r1v2-68075a3d947a5ae160272671)\n- [rLLM] [DeepScaler](https:\u002F\u002Fgithub.com\u002Fagentica-project\u002Frllm)\n- [NovaSky] [Sky-T1](https:\u002F\u002Fgithub.com\u002FNovaSky-AI\u002FSkyThought)\n- [GAIR-NLP] [O1 Replication Journey: A Strategic Progress Report](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FO1-Journey)\n- [OpenO1 Team] [Open-Source O1](https:\u002F\u002Fopensource-o1.github.io\u002F)\n- [Tencent] [DRT-o1](https:\u002F\u002Fgithub.com\u002Fkrystalan\u002FDRT-o1)\n- [Alibaba] [Marco-o1](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FMarco-o1)\n- [CUHK-SZ] [HuatuoGPT-o1](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FHuatuoGPT-o1)\n\n### Codebase\n- [OpenRLHF Team] [OpenRLHF](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF)\n- [OpenRLHF Team] [REINFORCE++ | REINFORCE++-baseline](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F387487679_REINFORCE_An_Efficient_RLHF_Algorithm_with_Robustnessto_Both_Prompt_and_Reward_Models) | [Code](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fblob\u002Fmain\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_llama_ray_hybrid_engine.sh)\n- [NovaSky-AI] [SkyRL](https:\u002F\u002Fgithub.com\u002FNovaSky-AI\u002FSkyRL)\n- [RUCAIBox] [STILL: Slow Thinking with LLMs](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FSlow_Thinking_with_LLMs)\n- [HKUST] [Simple Reinforcement Learning for Reasoning](https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002FsimpleRL-reason)\n  - This is a replicate of DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data\n- [Ubiquant] [Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning](https:\u002F\u002Fgithub.com\u002FUnakar\u002FLogic-RL)\n- [StepFun] [Open-Reasoner-Zero](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero)\n- [TideDra] [LMM-R1](https:\u002F\u002Fgithub.com\u002FTideDra\u002Flmm-r1)\n- [ModalMinds] [MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning](https:\u002F\u002Fgithub.com\u002FModalMinds\u002FMM-EUREKA)\n- [R1-V Team] [R1-V](https:\u002F\u002Fgithub.com\u002FDeep-Agent\u002FR1-V)\n- [LLaMA-Factory Team] [EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FEasyR1)\n- [Alibaba] [ROLL](https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL) | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06122)\n- [Sea AI Lab] [Dr. GRPO](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Funderstand-r1-zero)\n- [Berkeley AI Research] [TinyZero](https:\u002F\u002Fgithub.com\u002FJiayi-Pan\u002FTinyZero)\n- [Maitrix.org] [LLM Reasoners](https:\u002F\u002Fgithub.com\u002Fmaitrix-org\u002Fllm-reasoners)\n\n## Papers\n\n```\nformat:\n- [title](paper link) [links]\n  - author1, author2, and author3...\n  - publisher\n  - code\n  - experimental environments and datasets\n```\n\n### Technical Report on o1 Models\n- [INTELLECT-3: Technical Report](https:\u002F\u002Fstorage.googleapis.com\u002Fintellect-3-paper\u002FINTELLECT_3_Technical_Report.pdf)\n  - Prime Intellect Team\n- [DeepSeek V3.2](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdeepseek-ai\u002FDeepSeek-V3.2\u002Fresolve\u002Fmaster\u002Fassets\u002Fpaper.pdf)\n  - DeepSeek AI\n- [MiMo-V2 Flash](https:\u002F\u002Fgithub.com\u002FXiaomiMiMo\u002FMiMo-V2-Flash\u002Fblob\u002Fmain\u002Fpaper.pdf)\n  - Xiao Mi\n- [Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fnemotron\u002Ffiles\u002FNVIDIA-Nemotron-3-Nano-Technical-Report.pdf)\n  - NVIDIA\n- [Qwen3 Technical Report ](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3\u002Fblob\u002Fmain\u002FQwen3_Technical_Report.pdf)\n  - Qwen Team\n- [Scaling Agents via Continual Pre-training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.13310)\n  - Qwen Team\n- [WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.13309)\n  - Qwen Team\n- [Towards General Agentic Intelligence via Environment Scaling](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.13311)\n  - Qwen Team\n- [Magistral](https:\u002F\u002Fmistral.ai\u002Fstatic\u002Fresearch\u002Fmagistral.pdf)\n  - Mistral AI\n- [ERINE-4.5](https:\u002F\u002Fernie.baidu.com\u002Fblog\u002Fposts\u002Fernie4.5\u002F)\n  - Baidu\n- [LongCat Flash](https:\u002F\u002Fgithub.com\u002Fmeituan-longcat\u002FLongCat-Flash-Chat\u002Fblob\u002Fmain\u002Ftech_report.pdf)\n  - Meituan\n- [GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.06471)\n  - Zhipu AI\n- [Seed Thinking v1.5](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FSeed-Thinking-v1.5)\n  - Bytedance Seed\n- [DeepSeek-V3 Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19437)\n  - DeepSeek\n- [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1\u002Fblob\u002Fmain\u002FDeepSeek_R1.pdf)\n  - DeepSeek AI\n- [DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-Prover-V2\u002Ftree\u002Fmain)\n  - DeepSeek AI\n- [MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13585)\n  - MiniMax\n- [Kimi k2: Open agentic intelligence](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-K2\u002Fblob\u002Fmain\u002Ftech_report.pdf)\n  - MoonShot\n- [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-k1.5)\n  - MoonShot\n- [KIMI-VL TECHNICAL REPORT](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-VL\u002Fblob\u002Fmain\u002FKimi-VL.pdf)\n  - MoonShot\n- [Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimina-Prover-Preview)\n  - MoonShot & Numina\n- [Llama-Nemotron: Efficient Reasoning Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00949)\n  - NVIDIA\n- [Skywork Open Reasoner 1 Technical Report](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.22312)\n  - Skywork\n\n\n### 2025\n- [Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fnemotron\u002Ffiles\u002FNVIDIA-Nemotron-3-Nano-Technical-Report.pdf)\n  - NVIDIA \n- [QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2512.12967)\n  - Qwen Team \n- [DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2511.22570v1)\n  - DeepSeek-AI \n- [Stabilizing Reinforcement Learning with LLMs: Formulation and Practices](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2512.01374)\n  - Qwen Team\n- [The Art of Scaling Reinforcement Learning Compute for LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.13786)\n  - Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, Rishabh Agarwal \n- [BroRL: Scaling Reinforcement Learning via Broadened Exploration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.01180)\n  - Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Harchaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, Yi Dong\n- [Why Language Models Hallucinate](https:\u002F\u002Fopenai.com\u002Findex\u002Fwhy-language-models-hallucinate\u002F)\n  - OpenAI\n- [rStar2-Agent: Agentic Reasoning Technical Report](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2508.20722)\n  - Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, Mao Yang\n- [RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.16546v1#page=1.33)\n  - Hangzhan Jin, Sicheng, Sifan Wu, Mohammad Hamdaqa\n- [DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.14460)\n  - Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang\n- [Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2508.09726)\n  - Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, Dimitris Papailiopoulos\n- [ProRL V2 - Prolonged Training Validates RL Scaling Laws](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Flpr\u002Fprorlv2\u002F)\n  - Jian Hu, Mingjie Liu, Shizhe Diao, Ximing Lu, Xin Dong, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong\n- [Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.08221)\n  - Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, Shengyi Huang, Siran Yang, Jiamang Wang, Wenbo Su, Bo Zheng\n- [Group Sequence Policy Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.18071)\n  - Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin\n- [Your Efficient RL Framework Secretly Brings You Off-Policy RL Training](https:\u002F\u002Ffengyao.notion.site\u002Foff-policy-rl)\n  - Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, Jianfeng Gao\n- [Gemini 2.5 Pro Capable of Winning Gold at IMO 2025](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.15855)\n  - Yichen Huang, Lin F. Yang\n- [DeepSWE: Training a Fully Open-sourced, State-of-the-Art Coding Agent by Scaling RL](https:\u002F\u002Fwww.together.ai\u002Fblog\u002Fdeepswe)\n  - together.ai\n- [OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.20512)\n  - Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu\n- [ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24864)\n  - Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong\n- [REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F387487679_REINFORCE_An_Efficient_RLHF_Algorithm_with_Robustnessto_Both_Prompt_and_Reward_Models)\n  - Jian Hu, Jason Klein Liu, Wei Shen\n  - Code: [REINFORCE++-baseline](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fblob\u002Fmain\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_llama_ray_hybrid_engine.sh)\n- [Beyond the 80\u002F20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01939)\n  - Qwen Team\n- [The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning](https:\u002F\u002Fwww.alphaxiv.org\u002Foverview\u002F2506.01347)\n  - Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, Yu Meng\n- [Thinking with Generated Images](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22525)\n  - Ethan Chern, Zhulin Hu, Steffi Chern, Siqi Kou, Jiadi Su, Yan Ma, Zhijie Deng, Pengfei Liu\n- [Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15966)\n  - Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, Wenhu Chen\n- [DeepEyes: Incentivizing \"Thinking with Images\" via Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14362)\n  - Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu\n- [Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05464)\n  - Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He\n- [QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17667)\n  - Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan\n- [Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07773)\n  - Xinji Mai, Haotian Xu, Xing W, Weinong Wang, Yingying Zhang, Wenqiang Zhang\n  - Code: [https:\u002F\u002Fgithub.com\u002Fyyht\u002Fopenrlhf_async_pipline](https:\u002F\u002Fgithub.com\u002Fyyht\u002Fopenrlhf_async_pipline)\n- [A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11343)\n  - Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, Hanze Dong\n- [Reinforcement Learning for Reasoning in Large Language Models with One Training Example](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20571)\n  - Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen\n- [Process Reward Models That Think](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16828)\n  - Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang\n- [M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10449)\n  - Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M. Rush, Tri Dao\n- [A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07086)\n  - Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, Matthias Bethge\n- [Concise Reasoning via Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.05185)\n  - Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, Kartik Talamadupula\n- [VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.05118)\n  - YuYue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, etc.\n- [Inference-Time Scaling for Generalist Reward Modeling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.02495)\n  - Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, Yu Wu\n- [JudgeLRM: Large Reasoning Models as a Judge](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00050)\n  - Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He\n- [DAPO: An Open-Source LLM Reinforcement Learning System at Scale](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.14476)\n  - Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, etc.\n- [Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05171)\n  - Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein\n- [Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07572)\n  - Yuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, Aviral Kumar\n- [R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.05592)\n  - Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen\n- [Visual-RFT: Visual Reinforcement Fine-Tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01785)\n  - Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang\n- [Introducing Visual Perception Token into Multimodal Large Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17425)\n  - Runpeng Yu, Xinyin Ma, Xinchao Wang\n- [Scaling Test-Time Compute Without Verification or RL is Suboptimal](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12118)\n  - Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar\n- [LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.07374)\n  - Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica\n- [Demystifying Long Chain-of-Thought Reasoning in LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03373)\n  - Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue\n- [LIMR: Less is More for RL Scaling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11886)\n  - Xuefeng Li, Haoyang Zou, Pengfei Liu\n- [LIMO: Less is More for Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03387)\n  - Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu\n- [s1: Simple test-time scaling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393)\n  - Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto\n- [SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17161v1)\n  - Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma\n- [Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.11651)\n  - Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, Yuxiao Dong\n- [Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.02508)\n  - Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, Chuang Gan\n- [Distillation Quantification for Large Language Models](https:\u002F\u002Fgithub.com\u002FAegis1863\u002FLLMs-Distillation-Quantification\u002Fblob\u002Fmain\u002Fpaper.pdf)\n  - Sunbowen Lee, Junting Zhou, Chao Ao, etc.\n- [rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04519)\n  - Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang\n- [Evolving Deeper LLM Thinking](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09891)\n  - Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, Xinyun Chen\n- [The Lessons of Developing Process Reward Models in Mathematical Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07301)\n  - Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin\n- [Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04682)\n  - Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn\n- [PRMBENCH: A Fine-grained and Challenging Benchmark for Process-Level Reward Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.03124)\n  - Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, Yu Cheng\n- [Virgo: A Preliminary Exploration on Reproducing o1-like MLLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01904)\n  - Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen\n- [Imagine while Reasoning in Space: Multimodal Visualization-of-Thought](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07542)\n  - Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, Furu Wei\n- [LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06186)\n  - Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan\n- [From chaos to order: The atomic reasoner framework for fine-grained reasoning in large language models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15944)\n  - Jinyi Liu, Yan Zheng, Rong Cheng, Qiyu Wu, Wei Guo, Fei Ni, Hebin Liang, Yifu Yuan, Hangyu Mao, Fuzheng Zhang, Jianye Hao\n\n### 2024\n- [Deliberative alignment: reasoning enables safer language models](https:\u002F\u002Fopenai.com\u002Findex\u002Fdeliberative-alignment\u002F)\n  - OpenAI\n- [MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07095)\n  - Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Mądry\n- [From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03590)\n  - Scott McKinney\n- [LLM Critics Help Catch LLM Bugs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.00215)\n  - Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, Jan Leike\n- [Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.10292)\n  - Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine\n- [ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.16044)\n  - Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin\n- [Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15556)\n  - Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Dacheng Tao\n- [Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.03314)\n  - Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar\n- [An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.00724)\n  - Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, Yiming Yang\n- [Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2408.16737)\n  - Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, Mehran Kazemi\n- [Large Language Monkeys: Scaling Inference Compute with Repeated Sampling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21787)\n  - Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, Azalia Mirhoseini\n- [Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.09413)\n  - Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen\n- [Training Language Models to Self-Correct via Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12917)\n  - Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust\n- [Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.21187)\n  - Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu\n- [MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19260)\n  - Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia, Thomas Lin\n- [Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12122)\n  - An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, Zhenru Zhang\n- [Does RLHF Scale? Exploring the Impacts From Data, Model, and Method](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06000)\n  - Zhenyu Hou, Pengfan Du, Yilin Niu, Zhengxiao Du, Aohan Zeng, Xiao Liu, Minlie Huang, Hongning Wang, Jie Tang, Yuxiao Dong\n- [Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11504)\n  - Xinyan Guan, Yanjiang Liu, Xinyu Lu, Boxi Cao, Ben He, Xianpei Han, Le Sun, Jie Lou, Bowen Yu, Yaojie Lu, Hongyu Lin\n- [Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14135)\n  - Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Bo Wang, Shimin Li, Yunhua Zhou, Qipeng Guo, Xuanjing Huang, Xipeng Qiu\n- [Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09629)\n  - Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, Noah D. Goodman\n  - https:\u002F\u002Fgithub.com\u002Fezelikman\u002Fquiet-star\n- [Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.16579)\n  - Zhiheng Xi, Dingwen Yang, Jixuan Huang, Jiafu Tang, Guanyu Li, Yiwen Ding, Wei He, Boyang Hong, Shihan Do, Wenyu Zhan, Xiao Wang, Rui Zheng, Tao Ji, Xiaowei Shi, Yitao Zhai, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao Gui, Zuxuan Wu, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Yu-Gang Jiang\n  - https:\u002F\u002Fmathcritique.github.io\u002F\n- [On Designing Effective RL Reward at Training Time for LLM Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15115)\n  - Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weilin Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju Wang, Yi Wu\n- [Generative Verifiers: Reward Modeling as Next-Token Prediction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15240)\n  - Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal\n- [Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08146)\n  - Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar\n- [Improve Mathematical Reasoning in Language Models by Automated Process Supervision](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06592)\n  - Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, Abhinav Rastogi\n- [Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08935)\n  - Peiyi Wang, Lei Li, Zhihong Shao, R.X. Xu, Damai Dai, Yifei Li, Deli Chen, Y.Wu, Zhifang Sui\n- [Planning In Natural Language Improves LLM Search For Code Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03733)\n  - Evan Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, Will Song, Vaskar Nath, Ziwen Han, Sean Hendryx, Summer Yue, Hugh Zhang\n- [PROCESSBENCH: Identifying Process Errors in Mathematical Reasoning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.06559)\n  - Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin\n- [AFlow: Automating Agentic Workflow Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10762)\n  - Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, Chenglin Wu\n- [Interpretable Contrastive Monte Carlo Tree Search Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01707)\n  - Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu, Hongzhang Liu, Aiwei Liu, Xuming Hu, Lijie Wen\n- [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.07199)\n  - Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, Rafael Rafailov\n- [Mixture-of-Agents Enhances Large Language Model Capabilities](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.04692)\n  - Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James Zou\n- [Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.03271)\n  - Zhiyuan Hu, Chumin Liu, Xidong Feng, Yilun Zhao, See-Kiong Ng, Anh Tuan Luu, Junxian He, Pang Wei Koh, Bryan Hooi\n- [Advancing LLM Reasoning Generalists with Preference Trees](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.02078)\n  - Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan et al.\n- [Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.12253)\n  - Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu.\n- [AlphaMath Almost Zero: Process Supervision Without Process](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.03553)\n  - Guoxin Chen, Minpeng Liao, Chengxi Li, Kai Fan.\n- [ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.03816)\n  - Dan Zhang, Sining Zhoubian, Yisong Yue, Yuxiao Dong, and Jie Tang.\n- [Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18319)\n  - Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, Dacheng Tao\n- [Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14432)\n  - Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, Ziwei Liu\n- [MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16265)\n  - Jikun Kang, Xin Zhe Li, Xi Chen, Amirreza Kazemi, Qianyi Sun, Boxing Chen, Dong Li, Xu He, Quan He, Feng Wen, Jianye Hao, Jun Yao.\n- [Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.00451)\n  - Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, Michael Shieh.\n- [When is Tree Search Useful for LLM Planning? It Depends on the Discriminator](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.10890)\n  - Ziru Chen, Michael White, Raymond Mooney, Ali Payani, Yu Su, Huan Sun\n- [Chain of Thought Empowers Transformers to Solve Inherently Serial Problems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12875)\n  - Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma.\n- [To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12183)\n  - Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett\n- [Do Large Language Models Latently Perform Multi-Hop Reasoning?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.16837)\n  - Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, Sebastian Riedel\n- [Chain-of-Thought Reasoning Without Prompting](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.10200)\n  - Xuezhi Wang, Denny Zhou\n- [Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.06195)\n  - Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, Mao Yang\n- [Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09136)\n  - Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, Min Lin\n- [ReFT: Reasoning with Reinforced Fine-Tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.08967)\n  - Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, Hang Li\n- [VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01679)\n  - Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, Nicolas Le Roux\n- [Stream of Search (SoS): Learning to Search in Language](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03683)\n  - Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, Noah D. Goodman\n- [GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05229)\n  - Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar\n- [Evaluation of OpenAI o1: Opportunities and Challenges of AGI](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.18486)\n  - Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, Chao Cao, Hanqi Jiang, Hanxu Chen, Yiwei Li, Junhao Chen, etc.\n- [Evaluating LLMs at Detecting Errors in LLM Responses](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03602)\n  - Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, Xiaoxin Lu, Nan Zhang, Yusen Zhang, Ranran Haoran Zhang, Sujeeth Reddy Vummanthala, Salika Dave, Shaobo Qin, Arman Cohan, Wenpeng Yin, Rui Zhang\n- [On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.19924)\n  - Kevin Wang, Junbo Li, Neel P. Bhatt, Yihan Xi, Qiang Liu, Ufuk Topcu, Zhangyang Wang\n- [Not All LLM Reasoners Are Created Equal](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01748)\n  - Arian Hosseini, Alessandro Sordoni, Daniel Toyama, Aaron Courville, Rishabh Agarwal\n- [LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.13373)\n  - Karthik Valmeekam, Kaya Stechly, Subbarao Kambhampati\n- [A Comparative Study on Reasoning Patterns of OpenAI's o1 Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13639)\n  - Siwei Wu, Zhongyuan Peng, Xinrun Du, Tuney Zheng, Minghao Liu, Jialong Wu, Jiachen Ma, Yizhi Li, Jian Yang, Wangchunshu Zhou, Qunshu Lin, Junbo Zhao, Zhaoxiang Zhang, Wenhao Huang, Ge Zhang, Chenghua Lin, J.H. Liu\n- [Thinking LLMs: General Instruction Following with Thought Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10630)\n  - Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar\n- [Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning Through Trap Problems](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.06680)\n  - Jun Zhao, Jingqi Tong, Yurong Mou, Ming Zhang, Qi Zhang, Xuanjing Huang\n- [V-STaR: Training Verifiers for Self-Taught Reasoners](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.06457)\n  - Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal\n- [CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.08642)\n  - Tianlong Wang, Junzhe Chen, Xueting Han, Jing Bai\n- [RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02089)\n  - Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar\n- [Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.14283)\n  - Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, Bo An\n- [Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.16999)\n  - Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, Hongsheng Li\n### 2023\n- [Let's Verify Step by Step](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.20050)\n  - Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe\n- [V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14135)\n  - Penghao Wu, Saining Xie\n- [Training Chain-of-Thought via Latent-Variable Inference](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02179)\n  - Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous\n- [Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17179)\n  - Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, Jun Wang\n- [OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.09724)\n  - Fei Yu, Anningzhe Gao, Benyou Wang\n- [Reasoning with Language Model is Planning with World Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14992)\n  - Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, Zhiting Hu\n- [Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15028)\n  - Liu, Jiacheng, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, and Asli Celikyilmaz.\n- [Certified reasoning with language models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.04031)\n  - Gabriel Poesia, Kanishk Gandhi, Eric Zelikman, Noah D. Goodman\n- [Large Language Models Cannot Self-Correct Reasoning Yet](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01798)\n  - Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, Denny Zhou\n\n### 2022\n- [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.11903)\n  - Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou\n- [Self-Consistency Improves Chain of Thought Reasoning in Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.11171)\n  - Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou\n- [Self-critiquing models for assisting human evaluators](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05802)\n  - William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, Jan Leike\n- [Chain of Thought Imitation with Procedure Cloning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.10816)\n  - Mengjiao Yang, Dale Schuurmans, Pieter Abbeel, Ofir Nachum.\n- [STaR: Bootstrapping Reasoning With Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.14465)\n  - Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman\n- [Solving math word problems with processand outcome-based feedback](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.14275)\n  - Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, Irina Higgins\n\n### 2021\n- [Training Verifiers to Solve Math Word Problems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.14168)\n  - Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman\n- [Scalable Online Planning via Reinforcement Learning Fine-Tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2109.15316)\n  - Arnaud Fickinger, Hengyuan Hu, Brandon Amos, Stuart Russell, Noam Brown.\n- [Scaling Scaling Laws with Board Games](http:\u002F\u002Farxiv.org\u002Fabs\u002F2104.03113)\n  - Andy L. Jones.\n- [Show Your Work: Scratchpads for Intermediate Computation with Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00114)\n  - Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, Augustus Odena\n\n### Before 2021\n- [Improving Policies via Search in Cooperative Partially Observable Games](https:\u002F\u002Farxiv.org\u002Fabs\u002F1912.02318)\n  - Adam Lerer, Hengyuan Hu, Jakob Foerster, Noam Brown.\n- [Generative Language Modeling for Automated Theorem Proving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2009.03393)\n  - Stanislas Polu, Ilya Sutskever\n- [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.01815v1)\n  - David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis.\n","# 令人惊叹的LLM草莓（OpenAI o1）\n[![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome) ![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhijkzzz\u002FAwesome-LLM-Strawberry?color=yellow) ![GitHub forks](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fhijkzzz\u002FAwesome-LLM-Strawberry?color=9cf) [![GitHub license](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fhijkzzz\u002FAwesome-LLM-Strawberry)](https:\u002F\u002Fgithub.com\u002Fhijkzzz\u002FAwesome-LLM-Strawberry\u002Fblob\u002Fmain\u002FLICENSE)\n\n这是一个关于**OpenAI草莓(o1)及推理能力**的研究论文与博客合集。\n\n该仓库将持续更新，以追踪LLM推理领域的前沿进展。\n\n## OpenAI官方文档\n- [https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fguides\u002Freasoning](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fguides\u002Freasoning)\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_015542bbebcd.png\" width=\"600px\">\n\n## 新闻\n- [OpenAI] [推出深度研究](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-deep-research\u002F)\n- [OpenAI] [o3预览版与o3 mini](https:\u002F\u002Fopenai.com\u002F12-days\u002F)\n- [OpenAI] [推出ChatGPT Pro](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-chatgpt-pro\u002F)\n- [Google DeepMind] [Gemini 2.0闪电思维](https:\u002F\u002Fx.com\u002FJeffDean\u002Fstatus\u002F1869789813232341267)\n- [Ilya Sutskever] [具备推理能力的AI将更难预测](https:\u002F\u002Fwww.reuters.com\u002Ftechnology\u002Fartificial-intelligence\u002Fai-with-reasoning-power-will-be-less-predictable-ilya-sutskever-says-2024-12-14\u002F)\n- [SemianAlysis] [规模法则——O1 Pro架构、推理训练基础设施、Orion以及Claude 3.5 Opus的“失败”](https:\u002F\u002Fsemianalysis.com\u002F2024\u002F12\u002F11\u002Fscaling-laws-o1-pro-architecture-reasoning-training-infrastructure-orion-and-claude-3-5-opus-failures\u002F)\n- [DeepSeek] [DeepSeek-R1-Lite预览版现已上线：释放超强推理能力！](https:\u002F\u002Fapi-docs.deepseek.com\u002Fnews\u002Fnews1120)\n- [Moonshoot] [数学对标o1系列，搜索再次进化，Kimi 新推理模型与你一起拓展智能边界](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002Fg4DltigncX-4sfaQ6Qn1zA)\n- [Moonshoot] [Kimi 发布视觉思考模型 k1，多项理科测试行业领先](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002F8cip3dehL8OIfZSnbZ1ftQ)\n- [InternLM] [强推理模型书生InternThinker开放体验：自主生成高智力密度数据、具备元动作思考能力](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002Fl7fdHlETvhKgZmUl23EiRA)\n- [新智元] [万字独家爆光，首揭o1 pro架构！惊人反转，Claude 3.5 Opus没失败？](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FLozJEE1sAAYAOrEFDVb6mg)\n\n## 博客\n- [OpenAI] [学习如何用LLM进行推理](https:\u002F\u002Fopenai.com\u002Findex\u002Flearning-to-reason-with-llms\u002F)\n- [OpenAI] [OpenAI o1-mini推动低成本高效推理](https:\u002F\u002Fopenai.com\u002Findex\u002Fopenai-o1-mini-advancing-cost-efficient-reasoning)\n- [OpenAI] [用GPT-4找出GPT-4的错误](https:\u002F\u002Fopenai.com\u002Findex\u002Ffinding-gpt4s-mistakes-with-gpt-4\u002F)\n- [ARC-AGI] [OpenAI o3在ARC-AGI-Pub上取得突破性高分](https:\u002F\u002Farcprize.org\u002Fblog\u002Foai-o3-pub-breakthrough)\n- [Anthropic] [构建高效智能体](https:\u002F\u002Fwww.anthropic.com\u002Fresearch\u002Fbuilding-effective-agents)\n- [hijkzzz] [无需路由器重放即可稳定MoE RL：在线IcePop解决方案](https:\u002F\u002Fhijkzzz.notion.site\u002Fonline-ice-pop)\n- [hijkzzz] [在RLVR中，REINFORCE++基准线就足够了](https:\u002F\u002Fmedium.com\u002F@janhu9527\u002Freinforce-baseline-is-all-you-need-in-rlvr-f5406930aa85)\n- [hijkzzz] [探索OpenAI O1模型的复现](https:\u002F\u002Fhijkzzz.notion.site\u002Fexploring-openai-o1-model-replication?pvs=74)\n- [Nathan Lambert] [OpenAI的草莓、语言模型自我对话、推理规模法则以及更多推理成本](https:\u002F\u002Fwww.interconnects.ai\u002Fp\u002Fopenai-strawberry-and-inference-scaling-laws)\n- [Nathan Lambert] [逆向工程OpenAI的o1](https:\u002F\u002Fwww.interconnects.ai\u002Fp\u002Freverse-engineering-openai-o1)\n- [Andreas Stuhlmüller, jungofthewon] [监督过程，而非结果](https:\u002F\u002Fwww.alignmentforum.org\u002Fposts\u002FpYcFPMBtQveAjcSfH\u002Fsupervise-process-not-outcomes)\n- [Nouha Dziri] [O1模型是否已经破解了人类的推理能力？](https:\u002F\u002Fsubstack.com\u002Fhome\u002Fpost\u002Fp-148782195)\n- [Rishabh Agarwal] [利用自动生成的数据改进LLM推理：强化学习与验证器](https:\u002F\u002Frosanneliu.com\u002Fdlctfs\u002Fdlct_240531.pdf)\n- [Wei Shen] [RLHF中的泛化进展：奖励模型和PPO的影响洞察](https:\u002F\u002Fswtheking.notion.site\u002F4e0cbb325aaf458da710f0b36dbb239c?v=c9231e8c988b4d66a1d2dc34df4cf7b5)\n- [Dominater069] [Codeforces - 分析O1-Mini到底有多好](https:\u002F\u002Fcodeforces.com\u002Fblog\u002Fentry\u002F133887)\n- [Tibor Blaho] [总结我们在与OpenAI o1团队AMA一小时中所学到的内容](https:\u002F\u002Ftwitter-thread.com\u002Ft\u002F1834686946846597281)\n\n## 演讲\n- [Noam Brown] [关于AI规划力量的寓言：从扑克到外交](https:\u002F\u002Fwww.youtube.com\u002Fwatch?app=desktop&v=eaAonE58sLU)\n- [Noam Brown] [OpenAI o1与如何教会LLM更好地进行推理](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=jPluSXJpdrA&t=1669s)\n- [Hyung Won Chung] [不要教导，要激励](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=kYWUEV_e2ss)\n\n## 课程\n- [DeepLearning.AI] [使用o1进行推理](https:\u002F\u002Flearn.deeplearning.ai\u002Fcourses\u002Freasoning-with-o1)\n\n## Twitter\n\u003Cdetails>\n\u003Csummary>OpenAI开发者\u003C\u002Fsummary>\n\n- [2024年12月17日AMA期间API团队回答的所有问题](https:\u002F\u002Fcommunity.openai.com\u002Ft\u002Fall-the-questions-addressed-by-the-api-team-during-the-december-17-2024-ama\u002F1059780)\n- [我们今天上午10点至11点为开发者举办AMA。](https:\u002F\u002Fx.com\u002FOpenAIDevs\u002Fstatus\u002F1834608585151594537)\n- [今天我们预览了强化微调](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1865136373491208674)\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_1a61f3b92629.png\" width=\"360px\">\n\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Noam Brown\u003C\u002Fsummary>\n\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_732ef0b2fde5.png\" width=\"360px\">\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_b1bd53563fa7.png\" width=\"360px\">\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_4d37fb0e6b0d.png\" width=\"360px\">\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_52aef3db34db.png\" width=\"360px\">\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Jason Wei\u003C\u002Fsummary>\n\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_a8b45242ddf7.png\" width=\"360px\">\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_216361e31100.png\" width=\"360px\">\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_0a290c946a85.png\" width=\"360px\">\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>其他\u003C\u002Fsummary>\n\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_2dab0b45f21b.png\" width=\"360px\">\n\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_05f1b4922682.png\" width=\"360px\">\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_readme_64fcd129184a.png\" width=\"360px\">\n\n\u003C\u002Fdetails>\n\n## 开源项目\n\n### 模型\n- [阿里巴巴通义实验室] [Qwen3](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3)\n- [阿里巴巴通义实验室] [QwQ](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwQ-32B)\n- [阿里巴巴通义实验室] [QvQ](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQVQ-72B-Preview)\n- [DeepSeek] [DeepSeek R1](https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai\u002FDeepSeek-R1)\n- [NVIDIA] [Nemotron-Research-Reasoning-Qwen-1.5B](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002FNemotron-Research-Reasoning-Qwen-1.5B)\n- [Skywork] [Skywork R1V2](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FSkywork\u002Fskywork-r1v2-68075a3d947a5ae160272671)\n- [rLLM] [DeepScaler](https:\u002F\u002Fgithub.com\u002Fagentica-project\u002Frllm)\n- [NovaSky] [Sky-T1](https:\u002F\u002Fgithub.com\u002FNovaSky-AI\u002FSkyThought)\n- [GAIR-NLP] [O1 复现之旅：战略进展报告](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FO1-Journey)\n- [OpenO1 团队] [开源 O1](https:\u002F\u002Fopensource-o1.github.io\u002F)\n- [腾讯] [DRT-o1](https:\u002F\u002Fgithub.com\u002Fkrystalan\u002FDRT-o1)\n- [阿里巴巴] [Marco-o1](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FMarco-o1)\n- [香港中文大学深圳分校] [HuatuoGPT-o1](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FHuatuoGPT-o1)\n\n### 代码库\n- [OpenRLHF 团队] [OpenRLHF](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF)\n- [OpenRLHF 团队] [REINFORCE++ | REINFORCE++ 基线](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F387487679_REINFORCE_An_Efficient_RLHF_Algorithm_with_Robustnessto_Both_Prompt_and_Reward_Models) | [代码](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fblob\u002Fmain\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_llama_ray_hybrid_engine.sh)\n- [NovaSky-AI] [SkyRL](https:\u002F\u002Fgithub.com\u002FNovaSky-AI\u002FSkyRL)\n- [RUCAIBox] [STILL：使用 LLM 进行慢思考](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FSlow_Thinking_with_LLMs)\n- [香港科技大学] [用于推理的简单强化学习](https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002FsimpleRL-reason)\n  - 这是对 DeepSeek-R1-Zero 的复现，以及在小模型上用有限数据训练 DeepSeek-R1 的过程。\n- [Ubiquant] [Logic-RL：基于规则的强化学习释放 LLM 推理能力](https:\u002F\u002Fgithub.com\u002FUnakar\u002FLogic-RL)\n- [StepFun] [Open-Reasoner-Zero](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero)\n- [TideDra] [LMM-R1](https:\u002F\u002Fgithub.com\u002FTideDra\u002Flmm-r1)\n- [ModalMinds] [MM-EUREKA：通过大规模规则强化学习探索视觉顿悟时刻](https:\u002F\u002Fgithub.com\u002FModalMinds\u002FMM-EUREKA)\n- [R1-V 团队] [R1-V](https:\u002F\u002Fgithub.com\u002FDeep-Agent\u002FR1-V)\n- [LLaMA-Factory 团队] [EasyR1：高效、可扩展、多模态的 RL 训练框架](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FEasyR1)\n- [阿里巴巴] [ROLL](https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL) | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06122)\n- [Sea AI Lab] [Dr. GRPO](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Funderstand-r1-zero)\n- [伯克利人工智能研究组] [TinyZero](https:\u002F\u002Fgithub.com\u002FJiayi-Pan\u002FTinyZero)\n- [Maitrix.org] [LLM 推理器](https:\u002F\u002Fgithub.com\u002Fmaitrix-org\u002Fllm-reasoners)\n\n## 论文\n\n```\n格式：\n- [标题](论文链接) [链接]\n  - 作者1、作者2、作者3…\n  - 出版社\n  - 代码\n  - 实验环境和数据集\n```\n\n### 关于 o1 模型的技术报告\n- [INTELLECT-3：技术报告](https:\u002F\u002Fstorage.googleapis.com\u002Fintellect-3-paper\u002FINTELLECT_3_Technical_Report.pdf)\n  - Prime Intellect 团队\n- [DeepSeek V3.2](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdeepseek-ai\u002FDeepSeek-V3.2\u002Fresolve\u002Fmaster\u002Fassets\u002Fpaper.pdf)\n  - DeepSeek AI\n- [MiMo-V2 Flash](https:\u002F\u002Fgithub.com\u002FXiaomiMiMo\u002FMiMo-V2-Flash\u002Fblob\u002Fmain\u002Fpaper.pdf)\n  - 小米\n- [Nemotron 3 Nano：面向代理式推理的开放、高效的混合专家混合 Mamba-Transformer 模型](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fnemotron\u002Ffiles\u002FNVIDIA-Nemotron-3-Nano-Technical-Report.pdf)\n  - NVIDIA\n- [Qwen3 技术报告](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3\u002Fblob\u002Fmain\u002FQwen3_Technical_Report.pdf)\n  - 通义实验室\n- [通过持续预训练扩展智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.13310)\n  - 通义实验室\n- [WebResearcher：释放长时程智能体的无限推理能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.13309)\n  - 通义实验室\n- [通过环境扩展迈向通用代理智能](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.13311)\n  - 通义实验室\n- [Magistral](https:\u002F\u002Fmistral.ai\u002Fstatic\u002Fresearch\u002Fmagistral.pdf)\n  - Mistral AI\n- [ERINE-4.5](https:\u002F\u002Fernie.baidu.com\u002Fblog\u002Fposts\u002Fernie4.5\u002F)\n  - 百度\n- [LongCat Flash](https:\u002F\u002Fgithub.com\u002Fmeituan-longcat\u002FLongCat-Flash-Chat\u002Fblob\u002Fmain\u002Ftech_report.pdf)\n  - 美团\n- [GLM-4.5：代理式、推理与编程（ARC）基础模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.06471)\n  - 之普 AI\n- [Seed Thinking v1.5](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FSeed-Thinking-v1.5)\n  - 字节跳动 Seed\n- [DeepSeek-V3 技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19437)\n  - DeepSeek\n- [DeepSeek-R1：通过强化学习激励 LLM 的推理能力](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1\u002Fblob\u002Fmain\u002FDeepSeek_R1.pdf)\n  - DeepSeek AI\n- [DeepSeek-Prover-V2：通过强化学习进行子目标分解以推进形式数学推理](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-Prover-V2\u002Ftree\u002Fmain)\n  - DeepSeek AI\n- [MiniMax-M1：利用闪电注意力高效扩展推理时计算能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13585)\n  - MiniMax\n- [Kimi k2：开放的代理智能](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-K2\u002Fblob\u002Fmain\u002Ftech_report.pdf)\n  - MoonShot\n- [Kimi k1.5：用 LLM 扩展强化学习](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-k1.5)\n  - MoonShot\n- [KIMI-VL 技术报告](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-VL\u002Fblob\u002Fmain\u002FKimi-VL.pdf)\n  - MoonShot\n- [Kimina-Prover 预览：迈向大型形式推理模型的强化学习](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimina-Prover-Preview)\n  - MoonShot 和 Numina\n- [Llama-Nemotron：高效推理模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00949)\n  - NVIDIA\n- [Skywork Open Reasoner 1 技术报告](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.22312)\n  - Skywork\n\n### 2025年\n- [Nemotron-Cascade：面向通用推理模型的级联强化学习扩展](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fnemotron\u002Ffiles\u002FNVIDIA-Nemotron-3-Nano-Technical-Report.pdf)\n  - 英伟达\n- [QwenLong-L1.5：长上下文推理与记忆的后训练配方](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2512.12967)\n  - 通义实验室团队\n- [DeepSeekMath-V2：迈向自我验证的数学推理](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2511.22570v1)\n  - DeepSeek-AI\n- [利用大语言模型稳定强化学习：方法与实践](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2512.01374)\n  - 通义实验室团队\n- [为大语言模型扩展强化学习计算资源的艺术](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.13786)\n  - Devvrit Khatri、Lovish Madaan、Rishabh Tiwari、Rachit Bansal、Sai Surya Duvvuri、Manzil Zaheer、Inderjit S. Dhillon、David Brandfonbrener、Rishabh Agarwal\n- [BroRL：通过拓宽探索范围实现强化学习规模化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.01180)\n  - 刘健、刘明杰、卢希明、吴芳、扎伊德·哈尔乔伊、刁世哲、崔艺珍、帕夫洛·莫尔恰诺夫、杨俊、扬·考茨、董毅\n- [为什么语言模型会出现幻觉](https:\u002F\u002Fopenai.com\u002Findex\u002Fwhy-language-models-hallucinate\u002F)\n  - OpenAI\n- [rStar2-Agent：代理式推理技术报告](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2508.20722)\n  - 尚宁、刘一飞、朱毅、张丽琳娜、徐伟江、关鑫宇、张步泽、董炳成、周旭东、张博文、辛颖、苗子明、李斯嘉、杨帆、杨茂\n- [强化学习既非万能药，也非海市蜃楼：理解大语言模型的监督学习与强化学习微调](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.16546v1#page=1.33)\n  - 金航展、思诚、吴思凡、穆罕默德·哈姆达卡\n- [DuPO：通过双重偏好优化实现可靠的大语言模型自我验证](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.14460)\n  - 佘帅杰、鲍宇、陆宇、许陆、李涛、朱文浩、黄树坚、程善博、陆陆、王宇轩\n- [多采样以减少思考：用于简洁推理的分组过滤策略优化](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2508.09726)\n  - 施里瓦斯塔瓦、阿瓦达拉、巴拉昌德兰、加格、贝赫尔、帕派伊利奥普洛斯\n- [ProRL V2——长期训练验证强化学习缩放定律](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Flpr\u002Fprorlv2\u002F)\n  - 刘健、刘明杰、刁世哲、卢希明、董欣、帕夫洛·莫尔恰诺夫、崔艺珍、扬·考茨、董毅\n- [第一部分：技巧还是陷阱？深入探讨用于大语言模型推理的强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.08221)\n  - 刘子赫、刘家顺、何燕城、王伟勋、刘嘉恒、潘玲、胡鑫宇、熊绍攀、黄巨、刘健、黄圣义、杨思然、王佳芒、苏文博、郑博\n- [分组序列策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.18071)\n  - 郑楚杰、刘世轩、李明泽、陈雄辉、于博文、高昌、邓凯、刘玉琼、门锐、杨安、周京仁、林俊阳\n- [你的高效强化学习框架正悄悄为你带来离策略强化学习训练](https:\u002F\u002Ffengyao.notion.site\u002Foff-policy-rl)\n  - 姚峰、刘立源、张定怀、董承宇、尚景波、高建峰\n- [Gemini 2.5 Pro 有望在2025年国际数学奥林匹克竞赛中夺冠](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.15855)\n  - 黄一辰、杨林福\n- [DeepSWE：通过强化学习规模化训练一个完全开源的最先进编码代理](https:\u002F\u002Fwww.together.ai\u002Fblog\u002Fdeepswe)\n  - together.ai\n- [OctoThinker：中期激励推动强化学习规模化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.20512)\n  - 王增志、周凡、李雪峰、刘鹏飞\n- [ProRL：长期强化学习拓展大型语言模型的推理边界](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24864)\n  - 刘明杰、刁世哲、卢希明、刘健、董欣、崔艺珍、扬·考茨、董毅\n- [REINFORCE++：一种高效的RLHF算法，对提示和奖励模型均具有鲁棒性](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F387487679_REINFORCE_An_Efficient_RLHF_Algorithm_with_Robustnessto_Both_Prompt_and_Reward_Models)\n  - 刘健、刘杰森·克莱因、沈伟\n  - 代码：[REINFORCE++-baseline](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fblob\u002Fmain\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_llama_ray_hybrid_engine.sh)\n- [超越80\u002F20法则：高熵少数标记驱动大语言模型推理的有效强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01939)\n  - 通义实验室团队\n- [负强化在大语言模型推理中的惊人效果](https:\u002F\u002Fwww.alphaxiv.org\u002Foverview\u002F2506.01347)\n  - 朱新宇、夏孟州、魏哲沛、陈伟林、陈丹琪、孟宇\n- [借助生成图像进行思考](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22525)\n  - 切恩·伊森、胡竹林、切恩·斯特菲、寇思琪、苏嘉迪、马彦、邓志杰、刘鹏飞\n- [像素推理者：以好奇心驱动的强化学习激励像素空间推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15966)\n  - 苏亚历克斯、王浩哲、任伟明、林方振、陈文虎\n- [DeepEyes：通过强化学习激励“用图像思考”](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14362)\n  - 郑子威、杨迈克尔、洪杰克、赵晨晓、徐国海、杨乐、申超、于兴\n- [让视觉拥有推理能力：通过模型融合理解感知与推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05464)\n  - 陈诗琪、张静涵、朱彤瑶、刘伟、高思洋、熊淼、李曼玲、何俊贤\n- [QwenLong-L1：借助强化学习迈向长上下文大型推理模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17667)\n  - 万凡奇、沈伟洲、廖圣义、施英成、李晨亮、杨子怡、张继、黄飞、周京仁、严明\n- [代理强化学习缩放定律：具备自发代码执行能力的代理强化学习用于解数学题](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07773)\n  - 麦新吉、徐浩天、W星、王伟农、张莹莹、张文强\n  - 代码：[https:\u002F\u002Fgithub.com\u002Fyyht\u002Fopenrlhf_async_pipline](https:\u002F\u002Fgithub.com\u002Fyyht\u002Fopenrlhf_async_pipline)\n- [大语言模型推理的极简主义方法：从拒绝采样到Reinforce](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11343)\n  - 熊伟、姚家睿、徐雨慧、庞博、王雷、萨霍伊院长、李俊楠、蒋楠、张同、熊才明、董汉泽\n- [仅需一次训练样本即可实现大型语言模型的推理强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20571)\n  - 王一平、杨青、曾志远、任利昂、刘卢卡斯、彭宝林、程浩、何学海、王宽、高建峰、陈伟珠、王书航、杜思明、沈业龙\n- [能够思考的过程奖励模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16828)\n  - 哈利法、阿加瓦尔、洛格斯瓦兰、金在谦、彭浩、李蒙泰、李洪洛克、王陆\n- [M1：迈向可扩展的测试时计算，采用Mamba推理模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10449)\n  - 王俊雄、李文丁、帕利奥塔、里特尔、拉什、陶三\n- [对语言模型推理进展的清醒审视：陷阱与可重复性的路径](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07086)\n  - 霍赫莱纳特、巴特纳加尔、乌丹达劳、阿尔巴尼、普拉布、贝特格\n- [通过强化学习实现简洁推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.05185)\n  - 法特米、拉菲、唐明杰、塔拉马杜普拉\n- [VAPO：高效且可靠的强化学习，适用于高级推理任务](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.05118)\n  - 余悦、袁玉凤、俞启英、左晓晨、朱若飞、徐文渊、陈家泽、王承义、范甜甜、杜正印、魏向鹏、刘高宏、刘俊才、刘玲君、林海斌、林志奇、马伯乐等\n- [通用奖励模型的推理时间缩放](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.02495)\n  - 刘子俊、王佩仪、徐润欣、马士荣、阮冲、李鹏、刘洋、吴宇\n- [JudgeLRM：将大型推理模型作为裁判](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00050)\n  - 陈诺、胡志远、邹清云、吴佳颖、王茜、胡布莱恩、何炳生\n- [DAPO：大规模开源的大语言模型强化学习系统](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.14476)\n  - 俞启英、张正、朱若飞、袁玉凤、左晓晨、余悦、范甜甜、刘高宏、刘玲君、刘欣、林海斌等\n- [利用潜在推理扩展测试时计算：递归深度方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05171)\n  - 盖平、麦克莱什、贾因、基尔亨鲍尔、辛格、巴托尔德森、凯尔库拉、巴特勒、戈德斯坦\n- [通过元强化微调优化测试时计算](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07572)\n  - 曲宇晓、杨马修、塞特卢尔、坦斯托尔、比钦、萨拉胡丁诺夫、库马尔\n- [R1-Searcher：通过强化学习激励大语言模型的搜索能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.05592)\n  - 宋华通、姜金浩、闵英倩、陈杰、陈志鹏、赵韦恩新、方雷、温继荣\n- [Visual-RFT：视觉强化微调](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01785)\n  - 刘子宇、孙泽依、臧宇航、董晓依、曹宇航、段浩东、林大华、王佳琪\n- [将视觉感知标记引入多模态大型语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17425)\n  - 于润鹏、马新寅、王新超\n- [不进行验证或强化学习而扩展测试时计算是次优的](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12118)\n  - 塞特卢尔、拉贾拉曼、列维、库马尔\n- [大语言模型可以轻松地从演示中学会推理：重要的是结构，而非内容！](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.07374)\n  - 李大成、曹世义、格里格斯、刘舒、莫湘溪、帕蒂尔、扎哈里亚、冈萨雷斯、斯托伊卡\n- [揭秘大语言模型中的长链式思维推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03373)\n  - 叶欧德、童宇轩、牛莫瑞、纽比格、岳翔\n- [LIMR：更少反而更多，适用于强化学习缩放](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11886)\n  - 李雪峰、邹浩洋、刘鹏飞\n- [LIMO：更少反而更多，适用于推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03387)\n  - 叶新怡、黄震、肖杨、切恩·伊森、夏世杰、刘鹏飞\n- [s1：简单的测试时缩放](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393)\n  - 缪尼霍夫、杨子彤、史伟佳、李香丽莎、费费·李、哈吉希尔齐、泽特洛伊默、梁珀西、坎德斯、桥本达津诺里\n- [SFT记住细节，RL则泛化：基础模型后训练的比较研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17161v1)\n  - 楚天哲、翟月祥、杨纪韩、佟盛邦、谢赛宁、舒尔曼斯、黎光越、列维、马毅\n- [通过强化学习和推理缩放推进语言模型推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.11651)\n  - 侯振宇、吕欣、陆锐、张佳杰、李宇江、姚子俊、李娟子、唐杰、董宇晓\n- [Satori：基于行动—思维链的强化学习，通过自回归搜索增强大语言模型的推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.02508)\n  - 沈茂浩、曾广涛、齐振廷、洪章伟、陈真芳、魏陆、沃内尔、达斯、考克斯、甘创\n- [大型语言模型的蒸馏量化](https:\u002F\u002Fgithub.com\u002FAegis1863\u002FLLMs-Distillation-Quantification\u002Fblob\u002Fmain\u002Fpaper.pdf)\n  - 李孙博文、周俊婷、敖超等\n- [rStar-Math：小型大语言模型可通过自我进化式深度思考掌握数学推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04519)\n  - 关鑫宇、张丽琳娜、刘一飞、尚宁、孙友然、朱毅、杨帆、杨茂\n- [进化更深层次的大语言模型思维](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09891)\n  - 李匡辉、费舍尔、吴岳华、马伍德、巴鲁贾、舒尔曼斯、陈新云\n- [在数学推理中开发过程奖励模型的经验教训](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07301)\n  - 张振儒、郑楚杰、吴阳镇、张培臣、林润基、于博文、刘大亨、周京仁、林俊阳\n- [迈向大语言模型中的系统2式推理：学习如何用元思维链思考](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04682)\n  - 谢向薇、斯奈尔、甘地、阿尔巴拉克、辛格、布拉格登、冯杜、拉斐尔、莱尔、马汉、卡斯特里卡托、弗兰肯、哈伯、芬恩\n- [PRMBENCH：针对过程级奖励模型的细粒度且具有挑战性的基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.03124)\n  - 宋明阳、苏兆晨、曲晓叶、周嘉伟、程宇\n- [Virgo：关于复现o1类MLLM的初步探索](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01904)\n  - 杜一凡、刘子康、李一凡、赵韦恩新、霍宇琪、王炳宁、陈伟鹏、刘郑、王中原、温继荣\n- [在思考空间中想象：多模态思维可视化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07542)\n  - 李成祖、吴文山、张焕宇、夏燕、毛绍光、李东、武利奇、魏富\n- [LlamaV-o1：重新思考大语言模型中的逐步视觉推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06186)\n  - 塔瓦卡尔、迪桑纳亚克、莫雷、索卡尔、希克尔、阿赫桑、李宇豪、祖姆里、拉胡德、安维尔、乔拉卡尔、拉普捷夫、沙赫、汗、汗\n\n### 2024年\n- [审慎对齐：推理使语言模型更安全](https:\u002F\u002Fopenai.com\u002Findex\u002Fdeliberative-alignment\u002F)\n  - OpenAI\n- [MLE-bench：在机器学习工程领域评估机器学习智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07095)\n  - Jun Shern Chan、Neil Chowdhury、Oliver Jaffe、James Aung、Dane Sherburn、Evan Mays、Giulio Starace、Kevin Liu、Leon Maksin、Tejal Patwardhan、Lilian Weng、Aleksander Mądry\n- [从Medprompt到o1：医疗挑战问题及其他场景下的运行时策略探索](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03590)\n  - Scott McKinney\n- [LLM批评者助力捕捉LLM中的缺陷](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.00215)\n  - Nat McAleese、Rai Michael Pokorny、Juan Felipe Ceron Uribe、Evgenia Nitishinskaya、Maja Trebacz、Jan Leike\n- [通过强化学习将大型视觉—语言模型微调为决策智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.10292)\n  - Yuexiang Zhai、Hao Bai、Zipeng Lin、Jiayi Pan、Shengbang Tong、Yifei Zhou、Alane Suhr、Saining Xie、Yann LeCun、Yi Ma、Sergey Levine\n- [ZoomEye：通过基于树的图像探索增强多模态LLM的人类式缩放能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.16044)\n  - Haozhan Shen、Kangjia Zhao、Tiancheng Zhao、Ruochen Xu、Zilun Zhang、Mingwei Zhu、Jianwei Yin\n- [分而治之再合并：一种无需训练的框架，用于提升多模态大型语言模型的高分辨率图像感知能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15556)\n  - Wenbin Wang、Liang Ding、Minyan Zeng、Xiabin Zhou、Li Shen、Yong Luo、Dacheng Tao\n- [在LLM测试时以最优方式扩展计算资源，可能比单纯扩大模型参数更为有效](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.03314)\n  - Charlie Snell、Jaehoon Lee、Kelvin Xu、Aviral Kumar\n- [针对语言模型解决问题时计算最优推理的实证分析](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.00724)\n  - Yangzhen Wu、Zhiqing Sun、Shanda Li、Sean Welleck、Yiming Yang\n- [更小、更弱却更好：通过计算最优采样训练LLM推理模型](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2408.16737)\n  - Hritik Bansal、Arian Hosseini、Rishabh Agarwal、Vinh Q. Tran、Mehran Kazemi\n- [大型语言猴子：利用重复采样扩展推理计算能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21787)\n  - Bradley Brown、Jordan Juravsky、Ryan Ehrlich、Ronald Clark、Quoc V. Le、Christopher Ré、Azalia Mirhoseini\n- [模仿、探索与自我改进：慢思考推理系统的复现报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.09413)\n  - Yingqian Min、Zhipeng Chen、Jinhao Jiang、Jie Chen、Jia Deng、Yiwen Hu、Yiru Tang、Jiapeng Wang、Xiaoxue Cheng、Huatong Song、Wayne Xin Zhao、Zheng Liu、Zhongyuan Wang、Ji-Rong Wen\n- [通过强化学习训练语言模型进行自我修正](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12917)\n  - Aviral Kumar、Vincent Zhuang、Rishabh Agarwal、Yi Su、John D Co-Reyes、Avi Singh、Kate Baumli、Shariq Iqbal、Colton Bishop、Rebecca Roelofs、Lei M Zhang、Kay McKinney、Disha Shrivastava、Cosmin Paduraru、George Tucker、Doina Precup、Feryal Behbahani、Aleksandra Faust\n- [对于2+3=?这样的简单问题，请不要过度思考——关于o1类LLM的过度思考问题](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.21187)\n  - Xingyu Chen、Jiahao Xu、Tian Liang、Zhiwei He、Jianhui Pang、Dian Yu、Linfeng Song、Qiuzhi Liu、Mengfei Zhou、Zhuosheng Zhang、Rui Wang、Zhaopeng Tu、Haitao Mi、Dong Yu\n- [MEDEC：临床笔记中医学错误检测与纠正的基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19260)\n  - Asma Ben Abacha、Wen-wai Yim、Yujuan Fu、Zhaoyi Sun、Meliha Yetisgen、Fei Xia、Thomas Lin\n- [Qwen2.5-Math技术报告：通过自我改进迈向数学专家模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12122)\n  - An Yang、Beichen Zhang、Binyuan Hui、Bofei Gao、Bowen Yu、Chengpeng Li、Dayiheng Liu、Jianhong Tu、Jingren Zhou、Junyang Lin、Keming Lu、Mingfeng Xue、Runji Lin、Tianyu Liu、Xingzhang Ren、Zhenru Zhang\n- [RLHF能规模化吗？数据、模型和方法的影响探究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06000)\n  - Zhenyu Hou、Pengfan Du、Yilin Niu、Zhengxiao Du、Aohan Zeng、Xiao Liu、Minlie Huang、Hongning Wang、Jie Tang、Yuxiao Dong\n- [搜索、验证与反馈：通过验证器工程迈向基础模型的下一代后训练范式](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11504)\n  - Xinyan Guan、Yanjiang Liu、Xinyu Lu、Boxi Cao、Ben He、Xianpei Han、Le Sun、Jie Lou、Bowen Yu、Yaojie Lu、Hongyu Lin\n- [搜索与学习的规模化：从强化学习视角复现o1的路线图](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14135)\n  - Zhiyuan Zeng、Qinyuan Cheng、Zhangyue Yin、Bo Wang、Shimin Li、Yunhua Zhou、Qipeng Guo、Xuanjing Huang、Xipeng Qiu\n- [Quiet-STaR：语言模型可以教会自己先思考再说话](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09629)\n  - Eric Zelikman、Georges Harik、Yijia Shao、Varuna Jayasiri、Nick Haber、Noah D. Goodman\n  - https:\u002F\u002Fgithub.com\u002Fezelikman\u002Fquiet-star\n- [通过带有测试时和训练时监督的批评模型提升LLM推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.16579)\n  - Zhiheng Xi、Dingwen Yang、Jixuan Huang、Jiafu Tang、Guanyu Li、Yiwen Ding、Wei He、Boyang Hong、Shihan Do、Wenyu Zhan、Xiao Wang、Rui Zheng、Tao Ji、Xiaowei Shi、Yitao Zhai、Rongxiang Weng、Jingang Wang、Xunliang Cai、Tao Gui、Zuxuan Wu、Qi Zhang、Xipeng Qiu、Xuanjing Huang、Yu-Gang Jiang\n  - https:\u002F\u002Fmathcritique.github.io\u002F\n- [关于为LLM推理设计有效的训练时RL奖励](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15115)\n  - Jiaxuan Gao、Shusheng Xu、Wenjie Ye、Weilin Liu、Chuyi He、Wei Fu、Zhiyu Mei、Guangju Wang、Yi Wu\n- [生成式验证器：将奖励建模视为下一个标记预测](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15240)\n  - Lunjun Zhang、Arian Hosseini、Hritik Bansal、Mehran Kazemi、Aviral Kumar、Rishabh Agarwal\n- [奖励进展：为LLM推理扩展自动化流程验证器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08146)\n  - Amrith Setlur、Chirag Nagpal、Adam Fisch、Xinyang Geng、Jacob Eisenstein、Rishabh Agarwal、Alekh Agarwal、Jonathan Berant、Aviral Kumar\n- [通过自动化流程监督提升语言模型的数学推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06592)\n  - Liangchen Luo、Yinxiao Liu、Rosanne Liu、Samrat Phatale、Harsh Lara、Yunxuan Li、Lei Shu、Yun Zhu、Lei Meng、Jiao Sun、Abhinav Rastogi\n- [Math-Shepherd：无需人工标注，逐步验证并强化LLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08935)\n  - Peiyi Wang、Lei Li、Zhihong Shao、R.X. Xu、Damai Dai、Yifei Li、Deli Chen、Y.Wu、Zhifang Sui\n- [自然语言规划提升LLM代码生成搜索能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03733)\n  - Evan Wang、Federico Cassano、Catherine Wu、Yunfeng Bai、Will Song、Vaskar Nath、Ziwen Han、Sean Hendryx、Summer Yue、Hugh Zhang\n- [PROCESSBENCH：识别数学推理中的过程性错误](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.06559)\n  - Chujie Zheng、Zhenru Zhang、Beichen Zhang、Runji Lin、Keming Lu、Bowen Yu、Dayiheng Liu、Jingren Zhou、Junyang Lin\n- [AFlow：自动化代理式工作流生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10762)\n  - Jiayi Zhang、Jinyu Xiang、Zhaoyang Yu、Fengwei Teng、Xionghui Chen、Jiaqi Chen、Mingchen Zhuge、Xin Cheng、Sirui Hong、Jinlin Wang、Bingnan Zheng、Bang Liu、Yuyu Luo、Chenglin Wu\n- [可解释的对比蒙特卡洛树搜索推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01707)\n  - Zitian Gao、Boye Niu、Xuzheng He、Haotian Xu、Hongzhang Liu、Aiwei Liu、Xuming Hu、Lijie Wen\n- [Agent Q：面向自主AI智能体的高级推理与学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.07199)\n  - Pranav Putta、Edmund Mills、Naman Garg、Sumeet Motwani、Chelsea Finn、Divyansh Garg、Rafael Rafailov\n- [混合代理增强大型语言模型能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.04692)\n  - Junlin Wang、Jue Wang、Ben Athiwaratkun、Ce Zhang、James Zou\n- [思维的不确定性：基于不确定性的规划提升大型语言模型的信息获取能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.03271)\n  - Zhiyuan Hu、Chumin Liu、Xidong Feng、Yilun Zhao、See-Kiong Ng、Anh Tuan Luu、Junxian He、Pang Wei Koh、Bryan Hooi\n- [借助偏好树推进LLM通用型推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.02078)\n  - Lifan Yuan、Ganqu Cui、Hanbin Wang、Ning Ding、Xingyao Wang、Jia Deng、Boji Shan等\n- [通过想象、搜索和批评实现LLM的自我改进](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.12253)\n  - Ye Tian、Baolin Peng、Linfeng Song、Lifeng Jin、Dian Yu、Haitao Mi和Dong Yu。\n- [AlphaMath几乎为零：无流程的流程监督](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.03553)\n  - Guoxin Chen、Minpeng Liao、Chengxi Li、Kai Fan。\n- [ReST-MCTS*：通过流程奖励引导的树搜索实现LLM自我训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.03816)\n  - Dan Zhang、Sining Zhoubian、Yisong Yue、Yuxiao Dong和Jie Tang。\n- [Mulberry：借助集体蒙特卡洛树搜索赋予MLLM类似o1的推理与反思能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18319)\n  - Huanjin Yao、Jiaxing Huang、Wenhao Wu、Jingyi Zhang、Yibo Wang、Shunyu Liu、Yingjie Wang、Yuxin Song、Haocheng Feng、Li Shen、Dacheng Tao\n- [Insight-V：利用多模态大型语言模型探索长链视觉推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14432)\n  - Yuhao Dong、Zuyan Liu、Hai-Long Sun、Jingkang Yang、Winston Hu、Yongming Rao、Ziwei Liu\n- [MindStar：在推理阶段增强预训练LLM的数学推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16265)\n  - Jikun Kang、Xin Zhe Li、Xi Chen、Amirreza Kazemi、Qianyi Sun、Boxing Chen、Dong Li、Xu He、Quan He、Feng Wen、Jianye Hao、Jun Yao。\n- [蒙特卡洛树搜索通过迭代偏好学习提升推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.00451)\n  - Yuxi Xie、Anirudh Goyal、Wenyue Zheng、Min-Yen Kan、Timothy P. Lillicrap、Kenji Kawaguchi、Michael Shieh。\n- [何时树搜索对LLM规划有用？这取决于判别器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.10890)\n  - Ziru Chen、Michael White、Raymond Mooney、Ali Payani、Yu Su、Huan Sun\n- [思维链使Transformer能够解决本质上串行的问题](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12875)\n  - Zhiyuan Li、Hong Liu、Denny Zhou、Tengyu Ma。\n- [是否使用思维链？思维链主要有助于数学和符号推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12183)\n  - Zayne Sprague、Fangcong Yin、Juan Diego Rodriguez、Dongwei Jiang、Manya Wadhwa、Prasann Singhal、Xinyu Zhao、Xi Ye、Kyle Mahowald、Greg Durrett\n- [大型语言模型是否会潜在地进行多跳推理？](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.16837)\n  - Sohee Yang、Elena Gribovskaya、Nora Kassner、Mor Geva、Sebastian Riedel\n- [无需提示的思维链推理](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.10200)\n  - Xuezhi Wang、Denny Zhou\n- [相互推理使小型LLM成为更强大的问题解决者](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.06195)\n  - Zhenting Qi、Mingyuan Ma、Jiahang Xu、Li Lyna Zhang、Fan Yang、Mao Yang\n- [偏好优化链：改进LLM中的思维链推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09136)\n  - Xuan Zhang、Chao Du、Tianyu Pang、Qian Liu、Wei Gao、Min Lin\n- [ReFT：通过强化微调进行推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.08967)\n  - Trung Quoc Luong、Xinbo Zhang、Zhanming Jie、Peng Sun、Xiaoran Jin、Hang Li\n- [VinePPO：通过精细化信用分配释放LLM推理的RL潜力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01679)\n  - Amirhossein Kazemnejad、Milad Aghajohari、Eva Portelance、Alessandro Sordoni、Siva Reddy、Aaron Courville、Nicolas Le Roux\n- [搜索流（SoS）：学习如何在语言中进行搜索](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03683)\n  - Kanishk Gandhi、Denise Lee、Gabriel Grand、Muxin Liu、Winson Cheng、Archit Sharma、Noah D. Goodman\n- [GSM-符号：理解大型语言模型数学推理的局限性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05229)\n  - Iman Mirzadeh、Keivan Alizadeh、Hooman Shahrokhi、Oncel Tuzel、Samy Bengio、Mehrdad Farajtabar\n- [OpenAI o1的评估：AGI的机遇与挑战](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.18486)\n  - Tianyang Zhong、Zhengliang Liu、Yi Pan、Yutong Zhang、Yifan Zhou、Shizhe Liang、Zihao Wu、Yanjun Lyu、Peng Shu、Xiaowei Yu、Chao Cao、Hanqi Jiang、Hanxu Chen、Yiwei Li、Junhao Chen等\n- [评估LLM检测自身响应中错误的能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03602)\n  - Ryo Kamoi、Sarkar Snigdha Sarathi Das、Renze Lou、Jihyun Janice Ahn、Yilun Zhao、Xiaoxin Lu、Nan Zhang、Yusen Zhang、Ranran Haoran Zhang、Sujeeth Reddy Vummanthala、Salika Dave、Shaobo Qin、Arman Cohan、Wenpeng Yin、Rui Zhang\n- [关于OpenAI的o1模型的规划能力：可行性、最优性和泛化能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.19924)\n  - Kevin Wang、Junbo Li、Neel P. Bhatt、Yihan Xi、Qiang Liu、Ufuk Topcu、Zhangyang Wang\n- [并非所有LLM推理者都生而平等](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01748)\n  - Arian Hosseini、Alessandro Sordoni、Daniel Toyama、Aaron Courville、Rishabh Agarwal\n- [LLM仍然无法规划；LRM呢？对OpenAI的o1在PlanBench上的初步评估](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.13373)\n  - Karthik Valmeekam、Kaya Stechly、Subbarao Kambhampati\n- [OpenAI的o1模型推理模式比较研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13639)\n  - Siwei Wu、Zhongyuan Peng、Xinrun Du、Tuney Zheng、Minghao Liu、Jialong Wu、Jiachen Ma、Yizhi Li、Jian Yang、Wangchunshu Zhou、Qunshu Lin、Junbo Zhao、Zhaoxiang Zhang、Wenhao Huang、Ge Zhang、Chenghua Lin、J.H. Liu\n- [思考型LLM：结合思维生成的通用指令遵循](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10630)\n  - Tianhao Wu、Janice Lan、Weizhe Yuan、Jiantao Jiao、Jason Weston、Sainbayar Sukhbaatar\n- [通过陷阱问题探索大型语言模型在数学推理中的组合性不足](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.06680)\n  - Jun Zhao、Jingqi Tong、Yurong Mou、Ming Zhang、Qi Zhang、Xuanjing Huang\n- [V-STaR：为自学型推理者培训验证器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.06457)\n  - Arian Hosseini、Xingdi Yuan、Nikolay Malkin、Aaron Courville、Alessandro Sordoni、Rishabh Agarwal\n- [CPL：关键计划步骤学习提升LLM在推理任务中的泛化能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.08642)\n  - Tianlong Wang、Junzhe Chen、Xuting Han、Jing Bai\n- [RLEF：通过强化学习将代码LLM扎根于执行反馈](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02089)\n  - Tianhao Wu、Janice Lan、Weizhe Yuan、Jiantao Jiao、Jason Weston、Sainbayar Sukhbaatar\n- [Q*：通过审慎规划改进LLM的多步推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.14283)\n  - Chaojie Wang、Yanchen Deng、Zhiyi Lyu、Liang Zeng、Jujie He、Shuicheng Yan、Bo An\n- [视觉思维链：借助全面的数据集和基准测试推进多模态语言模型的思维链推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.16999)\n  - Hao Shao、Shengju Qian、Han Xiao、Guanglu Song、Zhuofan Zong、Letian Wang、Yu Liu、Hongsheng Li\n\n### 2023年\n- [让我们逐步验证](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.20050)\n  - Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe\n- [V*：引导式视觉搜索作为多模态大语言模型的核心机制](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14135)\n  - 吴鹏浩, 谢赛宁\n- [通过潜在变量推理训练思维链](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02179)\n  - 杜潘, 马修·D·霍夫曼, 大卫·多汉, 肖尔托·道格拉斯, 段安赫, 亚伦·帕里西, 帕维尔·绍佐夫, 查尔斯·萨顿, 沙拉德·维克拉姆, 里夫·A·索罗斯\n- [类似AlphaZero的树搜索可以指导大型语言模型的解码和训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17179)\n  - 冯锡东, 万子宇, 文慕宁, 斯蒂芬·马库斯·麦卡利尔, 温颖, 张伟楠, 王军\n- [OVM：面向数学推理规划的结果监督价值模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.09724)\n  - 于飞, 高安宁哲, 王本友\n- [利用语言模型进行推理就是使用世界模型进行规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14992)\n  - 郝世博, 顾毅, 马浩迪, 洪家华·乔舒亚, 王振, 王哲·黛西, 胡志廷\n- [别丢掉你的价值模型！用价值引导的蒙特卡洛树搜索解码生成更优文本](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15028)\n  - 刘、贾成、安德鲁·科恩、拉马克特·帕苏努鲁、叶金·崔、汉娜内·哈吉希尔齐以及阿斯莉·切利基尔马兹。\n- [使用语言模型进行可认证的推理](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.04031)\n  - 加布里埃尔·波西亚、卡尼什克·甘地、埃里克·泽利克曼、诺亚·D·古德曼\n- [大型语言模型目前仍无法自我纠正推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01798)\n  - 黄杰、陈欣云、斯瓦鲁普·米什拉、郑怀秀·史蒂文、余亚当斯·魏、宋欣莹、周登尼\n\n### 2022年\n- [思维链提示在大型语言模型中激发推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.11903)\n  - 杰森·韦伊、王雪芝、戴尔·舒尔曼斯、马尔滕·博斯马、布莱恩·伊希特、费伊·夏、埃德·奇、阮国、周登尼\n- [自洽性提升语言模型中的思维链推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.11171)\n  - 王雪芝、杰森·韦伊、戴尔·舒尔曼斯、阮国、埃德·奇、沙兰·纳朗、阿坎克莎·乔德里、周登尼\n- [用于辅助人类评估者的自我批判模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05802)\n  - 威廉·桑德斯、凯瑟琳·叶、杰夫·吴、史蒂文·比尔斯、龙·欧扬、乔纳森·沃德、扬·莱克\n- [基于程序克隆的思维链模仿](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.10816)\n  - 杨孟娇、戴尔·舒尔曼斯、皮特·阿贝尔、奥菲尔·纳楚姆\n- [STaR：以推理启动推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.14465)\n  - 埃里克·泽利克曼、吴宇怀、杰西·穆、诺亚·D·古德曼\n- [利用过程与结果反馈解决数学应用题](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.14275)\n  - 乔纳森·乌萨托、内特·库什曼、拉马纳·库马尔、弗朗西斯·宋、诺亚·西格尔、丽莎·王、安东尼娅·克雷斯威尔、杰弗里·欧文、伊琳娜·希金斯\n\n### 2021年\n- [训练验证器解决数学应用题](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.14168)\n  - 卡尔·科布、维尼特·科萨拉朱、穆罕默德·巴瓦里安、马克·陈、何伟·俊、卢卡什·凯泽、马蒂亚斯·普拉珀特、杰里·特沃雷克、雅各布·希尔顿、赖一郎·中野、克里斯托弗·赫塞、约翰·舒尔曼\n- [通过强化学习微调实现可扩展的在线规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2109.15316)\n  - 阿诺·菲金格、恒元·胡、布兰登·阿莫斯、斯图尔特·拉塞尔、诺姆·布朗\n- [用棋类游戏扩展规模法则](http:\u002F\u002Farxiv.org\u002Fabs\u002F2104.03113)\n  - 安迪·L·琼斯\n- [展示你的工作：语言模型的中间计算草稿纸](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00114)\n  - 马克斯韦尔·奈、安德斯·约翰·安德烈森、盖伊·古尔-阿里、亨里克·米哈列夫斯基、雅各布·奥斯汀、大卫·比伯、大卫·多汉、艾托尔·莱夫科维奇、马尔滕·博斯马、大卫·卢安、查尔斯·萨顿、奥古斯都·奥德纳\n\n### 2021年之前\n- [通过合作式部分可观测博弈中的搜索改进策略](https:\u002F\u002Farxiv.org\u002Fabs\u002F1912.02318)\n  - 亚当·莱勒、恒元·胡、雅各布·福斯特、诺姆·布朗\n- [用于自动定理证明的生成式语言建模](https:\u002F\u002Farxiv.org\u002Fabs\u002F2009.03393)\n  - 斯塔尼斯拉夫·波卢、伊利亚·苏茨克维尔\n- [通过通用强化学习算法的自我对弈掌握国际象棋和将棋](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.01815v1)\n  - 大卫·西尔弗、托马斯·休伯特、朱利安·施里特维瑟、伊万尼斯·安东格鲁、马修·莱、阿瑟·格兹、马克·兰克托特、洛朗·西弗、达尔尚·库马拉恩、托雷·格雷佩尔、蒂莫西·利利克拉普、卡伦·西蒙尼扬、德米斯·哈萨比斯。","# Awesome-LLM-Strawberry 快速上手指南\n\n**Awesome-LLM-Strawberry** 并非一个可直接安装的单一软件包，而是一个汇聚了 OpenAI o1（Strawberry）及各类大模型推理（Reasoning）前沿研究论文、博客、开源模型和代码库的精选资源列表。本指南将帮助开发者快速利用该仓库中的资源，搭建本地推理环境并运行开源的类 o1 推理模型。\n\n## 环境准备\n\n要复现或体验类 o1 的推理能力，您需要准备支持大模型推理的硬件和软件环境。\n\n### 系统要求\n*   **操作系统**: Linux (推荐 Ubuntu 20.04\u002F22.04) 或 macOS。\n*   **GPU**: 推荐使用 NVIDIA GPU。\n    *   运行小参数模型（如 7B-14B）：显存建议 ≥ 16GB (如 RTX 3090\u002F4090)。\n    *   运行中等参数模型（如 32B-70B）：显存建议 ≥ 24GB-48GB (多卡或 A100\u002FA800\u002FH800)。\n    *   运行大型推理模型（如 DeepSeek-R1 671B）：需要多卡集群或使用量化版本。\n*   **内存**: 系统 RAM 建议 ≥ 32GB，处理长上下文推理时建议 64GB+。\n*   **存储**: 至少预留 50GB+ 空间用于存放模型权重和依赖库。\n\n### 前置依赖\n确保已安装以下基础工具：\n*   **Python**: 3.10 或更高版本\n*   **Git**: 用于克隆仓库\n*   **CUDA Toolkit**: 与您的 GPU 驱动版本匹配（通常建议 12.1+）\n*   **Package Manager**: `pip` 或 `conda`\n\n## 安装步骤\n\n由于本仓库是资源集合，\"安装\"主要指获取相关开源代码库并配置推理框架。以下以目前最热门的 **DeepSeek-R1** 或 **QwQ** 模型为例，使用通用的推理框架进行部署。\n\n### 1. 克隆资源仓库\n首先获取最新的资源列表和研究动态：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhijkzzz\u002FAwesome-LLM-Strawberry.git\ncd Awesome-LLM-Strawberry\n```\n\n### 2. 选择并安装推理框架\n根据仓库中 `Open-source -> Codebase` 部分的推荐，您可以选择以下任一主流框架。这里以 **vLLM**（高性能推理）或 **Hugging Face Transformers** 为例。\n\n#### 方案 A：使用 vLLM (推荐，速度快)\n```bash\n# 创建虚拟环境\nconda create -n strawberry-reason python=3.10 -y\nconda activate strawberry-reason\n\n# 安装 vLLM (建议使用国内镜像源加速)\npip install vllm -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n#### 方案 B：使用 LLaMA-Factory (适合微调与推理一体化)\n仓库中提到了 `EasyR1` 和 `LLaMA-Factory`，适合想要复现训练过程的用户：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory.git\ncd LLaMA-Factory\npip install -e \".[torch,metrics]\" -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 3. 下载开源推理模型\n从 `Open-source -> Models` 列表中选择一个模型。以 **QwQ-32B** 或 **DeepSeek-R1-Distill** 为例，使用 `huggingface-cli` 下载（国内用户可使用镜像）：\n\n```bash\n# 安装 huggingface hub 工具\npip install huggingface_hub -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# 下载 QwQ-32B 模型 (示例)\n# 注意：大模型文件较大，请确保磁盘空间充足\nhuggingface-cli download --resume-download Qwen\u002FQwQ-32B --local-dir .\u002Fmodels\u002FQwQ-32B --local-dir-use-symlinks False\n```\n*注：若无法访问 HuggingFace，可尝试使用 ModelScope (魔搭社区) 下载对应模型。*\n\n## 基本使用\n\n安装完成后，您可以加载下载的模型进行推理测试。类 o1 模型的特点是会在输出最终答案前生成一段“思维链”（Chain of Thought）。\n\n### 使用 vLLM 启动服务\n以下命令将启动一个兼容 OpenAI API 格式的本地服务：\n\n```bash\npython -m vllm.entrypoints.openai.api_server \\\n    --model .\u002Fmodels\u002FQwQ-32B \\\n    --tensor-parallel-size 1 \\\n    --trust-remote-code \\\n    --port 8000\n```\n*如果是多卡环境，请调整 `--tensor-parallel-size` 为显卡数量。*\n\n### 发送推理请求\n使用 `curl` 或 Python 脚本向本地服务发送问题。观察输出，您会看到模型先进行长时间的思考（Thinking Process），然后给出结论。\n\n```bash\ncurl http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fchat\u002Fcompletions \\\n    -H \"Content-Type: application\u002Fjson\" \\\n    -d '{\n        \"model\": \".\u002Fmodels\u002FQwQ-32B\",\n        \"messages\": [\n            {\"role\": \"user\", \"content\": \"If I have 3 apples and buy 5 more, then give away half, how many do I have left? Think step by step.\"}\n        ],\n        \"max_tokens\": 2048\n    }'\n```\n\n### 预期输出示例\n模型返回的内容将包含类似以下的结构（具体取决于模型实现）：\n1.  **思考过程**: 模型内部推导步骤（例如：\"First, calculate total apples... then divide by two...\"）。\n2.  **最终答案**: 清晰的结论。\n\n> **提示**: 对于更复杂的复现项目（如 `OpenRLHF` 或 `SkyRL`），请参考仓库中 `Open-source -> Codebase` 部分对应的 GitHub 链接，查阅其具体的 `README.md` 以获取训练和高级推理指令。","某 AI 初创公司的算法团队正致力于复现 OpenAI o1 的推理能力，以构建垂直领域的复杂问题解决模型。\n\n### 没有 Awesome-LLM-Strawberry 时\n- **信息搜集效率低下**：研究人员需手动在 arXiv、Twitter 和技术博客间穿梭，耗费数天才能拼凑出关于\"o1 架构”或“强化学习验证（RLVR）”的零散资讯。\n- **错过关键前沿动态**：由于缺乏统一追踪源，团队容易遗漏如 DeepSeek-R1、Kimi k1 等竞品发布的最新推理技术细节，导致技术路线滞后。\n- **理论复现门槛高**：面对晦涩的论文和缺失的代码实现参考，工程师难以快速理解从“监督结果”转向“监督过程”的核心训练范式，试错成本极高。\n- **资源分散难整合**：官方文档、深度分析文章与开源项目分散各处，缺乏系统性整理，阻碍了团队对推理缩放定律（Inference Scaling Laws）的整体认知。\n\n### 使用 Awesome-LLM-Strawberry 后\n- **一站式获取前沿情报**：团队直接通过该仓库即可获取涵盖 OpenAI o3、Gemini 2.0 Flash Thinking 及国内大模型的最新推理进展，将调研时间从数天压缩至数小时。\n- **精准锁定核心技术路径**：借助仓库中精选的\"REINFORCE++ baseline\"、\"Online IcePop\"等技术博客，工程师迅速掌握了稳定 MoE 路由与高效推理训练的关键方法。\n- **加速模型复现进程**：依托整理的 o1 逆向工程分析与 ARC-AGI 评测突破案例，团队快速构建了基于“过程监督”的实验框架，显著减少了盲目尝试。\n- **构建系统化知识体系**：从官方指南到社区深度解读，所有资源按逻辑分类，帮助团队成员快速对齐对推理机制的理解，提升了协作效率。\n\nAwesome-LLM-Strawberry 将碎片化的推理技术情报转化为系统化的研发燃料，让团队在激烈的模型竞赛中抢占先机。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhijkzzz_Awesome-LLM-Strawberry_a1e70aa8.png","hijkzzz",null,"https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhijkzzz_76042d30.png","RLer + MLSyser \u002F 2  + NLPer \u002F 2","janhu9527@gmail.com","hujian.website","https:\u002F\u002Fgithub.com\u002Fhijkzzz",6904,369,"2026-04-02T13:35:18","Apache-2.0",1,"","未说明",{"notes":90,"python":88,"dependencies":91},"该项目（Awesome-LLM-Strawberry）是一个收集关于 OpenAI o1 及大模型推理相关研究论文、博客、开源模型和代码库的资源列表，本身不是一个可直接运行的软件工具或框架，因此 README 中未包含具体的操作系统、GPU、内存、Python 版本或依赖库等运行环境需求。用户若需运行列表中提到的具体开源模型（如 DeepSeek-R1, QwQ 等）或代码库（如 OpenRLHF, SkyRL 等），需参考各自项目仓库的说明文档。",[],[26,13],[94,95,96,97,98,99,100,101],"chain-of-thought","coding","llm","mathematics","mcts","openai-o1","strawberry","reinforcement-learning","2026-03-27T02:49:30.150509","2026-04-06T08:46:18.520601",[],[]]