[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-mbzuai-oryx--Awesome-LLM-Post-training":3,"similar-mbzuai-oryx--Awesome-LLM-Post-training":51},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":8,"readme_en":9,"readme_zh":10,"quickstart_zh":11,"use_case_zh":12,"hero_image_url":13,"owner_login":14,"owner_name":15,"owner_avatar_url":16,"owner_bio":17,"owner_company":18,"owner_location":18,"owner_email":18,"owner_twitter":18,"owner_website":19,"owner_url":20,"languages":21,"stars":26,"forks":27,"last_commit_at":28,"license":18,"difficulty_score":29,"env_os":30,"env_gpu":31,"env_ram":31,"env_deps":32,"category_tags":35,"github_topics":38,"view_count":45,"oss_zip_url":18,"oss_zip_packed_at":18,"status":46,"created_at":47,"updated_at":48,"faqs":49,"releases":50},10130,"mbzuai-oryx\u002FAwesome-LLM-Post-training","Awesome-LLM-Post-training","Awesome Reasoning LLM Tutorial\u002FSurvey\u002FGuide","Awesome-LLM-Post-training 是一个专注于大语言模型（LLM）后训练方法的精选资源库，旨在帮助开发者和研究人员深入理解并提升模型的推理能力。随着大模型基础能力的成熟，如何通过微调、强化学习及测试时扩展等技术进一步挖掘其逻辑推理与决策潜力，成为当前技术落地的关键难点。该资源库系统性地梳理了相关领域最具影响力的论文、代码实现、基准测试及教程，涵盖了从理论综述、奖励学习、策略优化到多智能体协作等全方位内容。\n\n其核心亮点在于提供了一套清晰的技术分类体系，将复杂的后训练方法归纳为微调、强化学习和测试时扩展三大类，并关联了最新的学术成果与开源框架。无论是希望复现前沿算法的工程师，还是致力于探索模型推理机制的研究人员，都能在此快速找到所需的理论依据与实践工具。通过整合分散的行业资源，Awesome-LLM-Post-training 有效降低了进入高阶大模型研发领域的门槛，是构建具备深度推理能力 AI 系统的得力助手。","# LLM Post-Training: A Deep Dive into Reasoning Large Language Models\n\n\n[![MIT License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-green.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT)  [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2502.21321-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.21321)  [![Maintenance](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMaintained%3F-yes-green.svg)](https:\u002F\u002Fgithub.com\u002Fzzli2022\u002FSystem2-Reasoning-LLM)\n[![Contribution Welcome](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FContributions-welcome-blue)]()\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FwaxVImv.png\" alt=\"Oryx Video-ChatGPT\">\n\u003C\u002Fp>\n\nWelcome to the **Awesome-LLM-Post-training** repository! This repository is a curated collection of the most influential papers, code implementations, benchmarks, and resources related to **Large Language Models (LLMs) Post-Training  Methodologies**. \n\nOur work is based on the following paper:  \n📄 **LLM Post-Training: A Deep Dive into Reasoning Large Language Models** – Available on [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2502.21321-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.21321)\n\n#### [Komal Kumar](mailto:komal.kumar@mbzuai.ac.ae)* , [Tajamul Ashraf](https:\u002F\u002Fwww.tajamulashraf.com)* , [Omkar Thawakar](https:\u002F\u002Fomkarthawakar.github.io\u002Findex.html) , [Rao Muhammad Anwer](https:\u002F\u002Fmbzuai.ac.ae\u002Fstudy\u002Ffaculty\u002Frao-muhammad-anwer\u002F) , [Hisham Cholakkal](https:\u002F\u002Fmbzuai.ac.ae\u002Fstudy\u002Ffaculty\u002Fhisham-cholakkal\u002F) , [Mubarak Shah](https:\u002F\u002Fwww.crcv.ucf.edu\u002Fperson\u002Fmubarak-shah\u002F) , [Ming-Hsuan Yang](https:\u002F\u002Fresearch.google\u002Fpeople\u002F105989\u002F) , [Philip H.S. Torr](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FPhilip_Torr) , [Fahad Shahbaz Khan](https:\u002F\u002Fsites.google.com\u002Fview\u002Ffahadkhans\u002Fhome) , and [Salman Khan](https:\u002F\u002Fsalman-h-khan.github.io\u002F)  \n\\* Equally contributing first authors\n\n \n- **Corresponding authors:** [Komal Kumar](mailto:komal.kumar@mbzuai.ac.ae), [Tajamul Ashraf](https:\u002F\u002Fwww.tajamulashraf.com\u002F).  \n\nFeel free to ⭐ star and fork this repository to keep up with the latest advancements and contribute to the community.\n\n---\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_b6f634c7e538.jpg\" width=\"45%\" hieght=\"50%\" \u002F>\n\u003C!--   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_ad5d583b5c73.jpg\" width=\"45%\" height=\"50%\" \u002F> -->\n\u003C\u002Fp>\nA taxonomy of post-training approaches for **LLMs**, categorized into Fine-tuning, Reinforcement Learning, and Test-time Scaling methods. We summarize the key techniques used in recent LLM models.\n\n---\n\n## 📌 Contents  \n\n| Section | Subsection |  \n| ------- | ----------- |  \n| [📖 Papers](#papers) | [Survey](#survey), [Theory](#theory), [Explainability](#explainability) |  \n| [🤖 LLMs in RL](#LLMs-in-RL) | LLM-Augmented Reinforcement Learning |  \n| [🏆 Reward Learning](#reward-learning) | [Human Feedback](#human-feedback), [Preference-Based RL](#preference-based-rl), [Intrinsic Motivation](#intrinsic-motivation) |  \n| [🚀 Policy Optimization](#policy-optimization) | [Offline RL](#offline-rl), [Imitation Learning](#imitation-learning), [Hierarchical RL](#hierarchical-rl) |  \n| [🧠 LLMs for Reasoning & Decision-Making](#llms-for-reasoning-and-decision-making) | [Causal Reasoning](#causal-reasoning), [Planning](#planning), [Commonsense RL](#commonsense-rl) |  \n| [🌀 Exploration & Generalization](#exploration-and-generalization) | [Zero-Shot RL](#zero-shot-rl), [Generalization in RL](#generalization-in-rl), [Self-Supervised RL](#self-supervised-rl) |  \n| [🤝 Multi-Agent RL (MARL)](#multi-agent-rl-marl) | [Emergent Communication](#emergent-communication), [Coordination](#coordination), [Social RL](#social-rl) |  \n| [⚡ Applications & Benchmarks](#applications-and-benchmarks) | [Autonomous Agents](#autonomous-agents), [Simulations](#simulations), [LLM-RL Benchmarks](#llm-rl-benchmarks) |  \n| [📚 Tutorials & Courses](#tutorials-and-courses) | [Lectures](#lectures), [Workshops](#workshops) |  \n| [🛠️ Libraries & Implementations](#libraries-and-implementations) | Open-Source RL-LLM Frameworks |  \n| [🔗 Other Resources](#other-resources) | Additional Research & Readings |  \n\n---\n\n# 📖 Papers  \n\n## 🔍 Survey  \n\n| Title | Publication Date | Link |\n|---------------------------------|----------------|---------------------------------|\n| A Survey on Bridging VLMs and Synthetic Data | 16 May 2025 | [OpenReview](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ThjDCZOljE) |\n| A Survey on Post-training of Large Language Models | 8 Mar 2025 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.06072) |\n| LLM Post-Training: A Deep Dive into Reasoning Large Language Models | 28 Feb 2025 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.21321) |\n| From System 1 to System 2: A Survey of Reasoning Large Language Models | 25 Feb 2025 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17419) |\n| Empowering LLMs with Logical Reasoning: A Comprehensive Survey | 24 Feb 2025 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.15652)|\n| Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models | 16 Jan 2025 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09686) |\n|Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey   | 26 Sep 2024 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.18169) |\n| Reasoning with Large Language Models, a Survey | 16 July 2024 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11511) |\n| Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods | 30 Mar 2024 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.00282) |\n| Reinforcement Learning Enhanced LLMs: A Survey | 5 Dec 2024 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10400) |\n| Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey | 29 Dec 2024 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.20367) |\n| Large Language Models: A Survey of Their Development, Capabilities, and Applications | 15 Jan 2025 | [Springer](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002Fs10115-024-02310-4) |\n| A Survey on Multimodal Large Language Models | 10 Feb 2025 | [Oxford Academic](https:\u002F\u002Facademic.oup.com\u002Fnsr\u002Farticle\u002F11\u002F12\u002Fnwae403\u002F7896414) |\n| Large Language Models (LLMs): Survey, Technical Frameworks, and Future Directions | 20 Jul 2024 | [Springer](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002Fs10462-024-10888-y) |\n| Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machines | 11 Feb 2024 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.07069) |\n| ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models | 14 Mar 2024 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09583) |\n| Reinforcement Learning Problem Solving with Large Language Models | 29 Apr 2024 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.18638) |\n| A Survey on Large Language Models for Reinforcement Learning | 10 Dec 2023 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04567) |\n| Large Language Models as Decision-Makers: A Survey | 23 Aug 2023 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.11749) |\n| A Survey on Large Language Model Alignment Techniques | 6 May 2023 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.00921) |\n| Reinforcement Learning with Human Feedback: A Survey | 12 April 2023 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.04989) |\n| Reasoning with Large Language Models: A Survey | 14 Feb 2023 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.06476) |\n| A Survey on Foundation Models for Decision Making | 9 Jan 2023 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.04150) |\n| Large Language Models in Reinforcement Learning: Opportunities and Challenges | 5 Dec 2022 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09142) |\n| Training language models to follow instructions with human feedback | 4 Mar 2022 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.02155) |\n\n\n\n---\n\n## 🤖 LLMs-in-RL  \n\n* Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play [[Paper]](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2509.25541) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.09-red)\n* Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.02508) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [[Paper]](https:\u002F\u002Fpretty-radio-b75.notion.site\u002FDeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNotion-2025.02-red)\n* QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.02584) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Process Reinforcement through Implicit Rewards [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.11651) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17030) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12948) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* Kimi k1.5: Scaling Reinforcement Learning with LLMs [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12599) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06000) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* Offline Reinforcement Learning for LLM Multi-Step Reasoning [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16145) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* ReFT: Representation Finetuning for Language Models [[Paper]](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.410.pdf) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FACL-2024-blue)\n* Deepseekmath: Pushing the limits of mathematical reasoning in open language models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.03300) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.02-red)\n* Reasoning with Reinforced Functional Token Tuning [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13389) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Value-Based Deep RL Scales Predictably [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04327) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* InfAlign: Inference-aware language model alignment [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19792) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* LIMR: Less is More for RL Scaling [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11886) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.143) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n \n\n---\n\n## 🏆 Reward Learning (Process Reward Models)\n\n* PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.03124) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07861) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* The Lessons of Developing Process Reward Models in Mathematical Reasoning. [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07301) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark. [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01290) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* AutoPSV: Automated Process-Supervised Verifier [[Paper]](https:\u002F\u002Fopenreview.net\u002Fforum?id=eOAPWWOGs9) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS-2024-blue)\n* ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https:\u002F\u002Fopenreview.net\u002Fforum?id=8rcFOqEud5) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS-2024-blue)\n* Free Process Rewards without Process Labels. [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01981) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* Outcome-Refining Process Supervision for Code Generation [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15118) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [[Paper]](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.510\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FACL-2024-blue)\n* OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning [[Paper]](https:\u002F\u002Faclanthology.org\u002F2024.findings-naacl.55\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FACL_Findings-2024-blue)\n* Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.18629) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.06-red)\n* Let's Verify Step by Step. [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.20050) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.05-red)\n* Improve Mathematical Reasoning in Language Models by Automated Process Supervision [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05372) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2023.06-red)\n* Making Large Language Models Better Reasoners with Step-Aware Verifier [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.02336) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2023.06-red)\n* Solving Math Word Problems with Process and Outcome-Based Feedback [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.14275) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2022.11-red)\n* Uncertainty-Aware Step-wise Verification with Generative Reward Models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11250) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [[Paper]](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2502.13943) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models [[Paper]](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2502.08922) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06703) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19328) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20325) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.05-red)\n* LLMs Meet Finance: Fine-Tuning Foundation Models for the Open FinLLM Leaderboard [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.13125) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.04-red)\n---\n\n## Policy Optimization\n\n* Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06892) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.07-red)\n\n---\n\n##  MCTS\u002FTree Search\n* On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes [[Paper]](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F10870057\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FIEEE_TAC-2025-blue)\n* Search-o1: Agentic Search-Enhanced Large Reasoning Models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05366) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04519) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.03816) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.09078) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18925) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18319) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* Proposing and solving olympiad geometry with guided tree search [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10673) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.11605) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17397) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04329) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04459) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15645) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14405) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11053) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding [[Paper]](https:\u002F\u002Fopenreview.net\u002Fforum?id=kh9Zt2Ldmn#discussion) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCoLM-2024-blue)\n* AFlow: Automating Agentic Workflow Generation [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10762) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* Interpretable Contrastive Monte Carlo Tree Search Reasoning [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01707) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02884) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06508) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16033) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17820) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.09584) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.09-red)\n* Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.10635) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.08-red)\n* LiteSearch: Efficacious Tree Search for LLM [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.00320) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.07-red)\n* Tree Search for Language Model Agents [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.01476) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.07-red)\n* Uncertainty-Guided Optimization on Large Language Model Search Trees [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.03951) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.07-red)\n* * Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07394) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.06-red)\n* Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping [[Paper]](https:\u002F\u002Fopenreview.net\u002Fforum?id=rviGTsl0oy) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FICLR_WorkShop-2024-blue)\n* LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models [[Paper]](https:\u002F\u002Fopenreview.net\u002Fforum?id=h1mvwbQiXR) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FICLR_WorkShop-2024-blue)\n* AlphaMath Almost Zero: process Supervision without process [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.03553) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.05-red)\n* Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.15383) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.05-red)\n* MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16265) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.05-red)\n* Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.00451) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.05-red)\n* Stream of Search (SoS): Learning to Search in Language [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03683) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.04-red)\n* Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.12253) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.04-red)\n* Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models [[Paper]](https:\u002F\u002Fopenreview.net\u002Fforum?id=CVpuVe1N22&noteId=aTI8PGpO47) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS-2024-blue)\n* Reasoning with Language Model is Planning with World Model [[Paper]](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.507\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FEMNLP-2023-blue)\n* Large Language Models as Commonsense Knowledge for Large-Scale Task Planning [[Paper]](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS-2023-blue)\n* ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING [[Paper]](https:\u002F\u002Fopenreview.net\u002Fforum?id=PJfc4x2jXY) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS_WorkShop-2023-blue)\n* Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training [[Paper]](https:\u002F\u002Fopenreview.net\u002Fforum?id=PJfc4x2jXY) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS_WorkShop-2023-blue)\n* MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15028) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2023.09-red)\n* Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11169) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11881) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Fine-grained Conversational Decoding via Isotropic and Proximal Search [[Paper]](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.5\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FEMNLP-2023-blue)\n* Control-DAG: Constrained Decoding for Non-Autoregressive Directed Acyclic T5 using Weighted Finite State Automata [[Paper]](https:\u002F\u002Faclanthology.org\u002F2024.naacl-short.42\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNAACL-2024-blue)\n* Look-back Decoding for Open-Ended Text Generation [[Paper]](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.66\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FEMNLP-2023-blue)\n* LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17925) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n\n---\n\n## Explainability\n* Agents Thinking Fast and Slow: A Talker-Reasoner Architecture [[Paper]](https:\u002F\u002Fopenreview.net\u002Fforum?id=xPhcP6rbI4) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS_WorkShop-2024-blue)\n* What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23743) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1 [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01792) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* The Impact of Reasoning Step Length on Large Language Models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.04925) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.08-red)\n* Distilling System 2 into System 1 [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.06023) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.07-red)\n* System 2 Attention (is something you might need too) [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.11829) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2023.11-red)\n* Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04682) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06186) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19230) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Exploring Iterative Enhancement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models [[Paper]](http:\u002F\u002Farxiv.org\u002Fabs\u002F2309.10444) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAAAI\u002FEAAI-2025-blue)\n* AbductionRules: Training Transformers to Explain Unexpected Inputs [[Paper]](https:\u002F\u002Faclanthology.org\u002F2022.findings-acl.19\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FACL_Findings-2022-blue)\n## Multimodal Agent related Slow-Fast System\n* Diving into Self-Evolving Training for Multimodal Reasoning [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17451) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* Visual Agents as Fast and Slow Thinkers [[Paper]](https:\u002F\u002Fopenreview.net\u002Fforum?id=ncCuiD3KJQ) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FICLR-2025-blue)\n* Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01904) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03704) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* Slow Perception: Let's Perceive Geometric Figures Step-by-Step [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.20631) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11930) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* LLaVA-o1: Let Vision Language Models Reason Step-by-Step [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10440) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* Vision-Language Models Can Self-Improve Reasoning via Reflection [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00855) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.10458) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13957) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n## Benchmark and Datasets\n* PhyX: Does Your Model Have the \"Wits\" for Physical Reasoning? [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15929) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.05-red)\n* Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17387) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.03124) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [[Paper]](https:\u002F\u002Fopenreview.net\u002Fforum?id=GN2qbxZlni) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS-2024-blue)\n* Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.21187) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.15277) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.09-red)\n* EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12466) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14739) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14191) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04872) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* Evaluation of OpenAI o1: Opportunities and Challenges of AGI [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.18486) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.09-red)\n* MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06453) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15089) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* Humanity's Last Exam [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.14249) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* LR\u003Csup>2\u003C\u002Fsup>Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17848) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* BIG-Bench Extra Hard [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19187) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.09430) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FICONIP-2024-blue)\n* Multi-Step Deductive Reasoning Over Natural Language: An Empirical Study on Out-of-Distribution Generalisation [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.14000) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeSy-2022-blue)\n* Large Language Models Are Not Strong Abstract Reasoners [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.19555) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FIJCAI-2024-blue)\n\n## Reasoning and Safety\n* Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00555v1) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.03-red)\n* OverThink: Slowdown Attacks on Reasoning LLMs [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.02542) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* GuardReasoner: Towards Reasoning-based LLM Safeguards [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18492) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FICLR_WorkShop-2025-blue)\n* SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12025) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13458) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12025) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1\u002Fo3, DeepSeek-R1, and Gemini 2.0 Flash Thinking [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12893) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12202) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Abstract Meaning Representation-Based Logic-Driven Data Augmentation for Logical Reasoning [[Paper]](https:\u002F\u002Faclanthology.org\u002F2024.findings-acl.353\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FACL_Findings-2024-blue)\n* ChatLogic: Integrating Logic Programming with Large Language Models for Multi-step Reasoning [[Paper]](https:\u002F\u002Fopenreview.net\u002Fforum?id=AOqGF7Po7Z) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAAAI_WorkShop-2024-blue)\n\n---\n## 🚀 RL & LLM Fine-Tuning Repositories\n\n| #  | Repository & Link                                                                                          | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |\n|----|------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| 1  | [**RL4VLM**](https:\u002F\u002Fgithub.com\u002FRL4VLM\u002FRL4VLM) \u003Cbr>\u003Cbr> _Archived & Read-Only as of December 15, 2024_       | Offers code for fine-tuning large vision-language models as decision-making agents via RL. Includes implementations for training models with task-specific rewards and evaluating them in various environments.                                                                                                                                                                                                                                                                                                         |\n| 2  | [**LlamaGym**](https:\u002F\u002Fgithub.com\u002FKhoomeiK\u002FLlamaGym)                                                      | Simplifies fine-tuning large language model (LLM) agents with online RL. Provides an abstract `Agent` class to handle various aspects of RL training, allowing for quick iteration and experimentation across different environments.                                                                                                                                                                                                                                                                                      |\n| 3  | [**RL-Based Fine-Tuning of Diffusion Models for Biological Sequences**](https:\u002F\u002Fgithub.com\u002Fmasa-ue\u002FRLfinetuning_Diffusion_Bioseq) | Accompanies a tutorial and review paper on RL-based fine-tuning, focusing on the design of biological sequences (DNA\u002FRNA). Provides comprehensive tutorials and code implementations for training and fine-tuning diffusion models using RL.                                                                                                                                                                                                                                     |\n| 4  | [**LM-RL-Finetune**](https:\u002F\u002Fgithub.com\u002Fzhixuan-lin\u002FLM-RL-finetune)                                       | Aims to improve KL penalty optimization in RL fine-tuning of language models by computing the KL penalty term analytically. Includes configurations for training with Proximal Policy Optimization (PPO).                                                                                                                                                                                                                                                                                                                     |\n| 5  | [**InstructLLaMA**](https:\u002F\u002Fgithub.com\u002Fmichaelnny\u002FInstructLLaMA)                                           | Implements pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF) to train and fine-tune the LLaMA2 model to follow human instructions, similar to InstructGPT or ChatGPT.                                                                                                                                                                                                                                                                                                       |\n| 6  | [**SEIKO**](https:\u002F\u002Fgithub.com\u002Fzhaoyl18\u002FSEIKO)                                                             | Introduces a novel RL method to efficiently fine-tune diffusion models in an online setting. Its techniques outperform baselines such as PPO, classifier-based guidance, and direct reward backpropagation for fine-tuning Stable Diffusion.                                                                                                                                                                                                                                                                              |\n| 7  | [**TRL (Train Transformer Language Models with RL)**](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl)                  | A state-of-the-art library for post-training foundation models using methods like Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), GRPO, and Direct Preference Optimization (DPO). Built on the 🤗 Transformers ecosystem, it supports multiple model architectures and scales efficiently across hardware setups.                                                                                                                                                                                     |\n| 8  | [**Fine-Tuning Reinforcement Learning Models as Continual Learning**](https:\u002F\u002Fgithub.com\u002FBartekCupial\u002Ffinetuning-RL-as-CL) | Explores fine-tuning RL models as a forgetting mitigation problem (continual learning). Provides insights and code implementations to address forgetting in RL models.                                                                                                                                                                                                                                                                                                                                                        |\n| 9  | [**RL4LMs**](https:\u002F\u002Fgithub.com\u002Fallenai\u002FRL4LMs)                                                            | A modular RL library to fine-tune language models to human preferences. Rigorously evaluated through 2000+ experiments using the GRUE benchmark, ensuring robustness across various NLP tasks.                                                                                                                                                                                                                                                                                                                             |\n| 10 | [**Lamorel**](https:\u002F\u002Fgithub.com\u002Fflowersteam\u002Flamorel)                                                      | A high-throughput, distributed architecture for seamless LLM integration in interactive environments. While not specialized in RL or RLHF by default, it supports custom implementations and is ideal for users needing maximum flexibility.                                                                                                                     |\n| 11 | [**LLM-Reverse-Curriculum-RL**](https:\u002F\u002Fgithub.com\u002FWooooDyy\u002FLLM-Reverse-Curriculum-RL)                     | Implements the ICML 2024 paper *\"Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning\"*. Focuses on enhancing LLM reasoning capabilities using a reverse curriculum RL approach.                                                                                                                                                                                                                                                  |\n| 12 | [**veRL**](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl)                                                             | A flexible, efficient, and production-ready RL training library for large language models (LLMs). Serves as the open-source implementation of the HybridFlow framework and supports various RL algorithms (PPO, GRPO), advanced resource utilization, and scalability up to 70B models on hundreds of GPUs. Integrates with Hugging Face models, supervised fine-tuning, and RLHF with multiple reward types.                                                  |\n| 13 | [**trlX**](https:\u002F\u002Fgithub.com\u002FCarperAI\u002Ftrlx)                                                               | A distributed training framework for fine-tuning large language models (LLMs) with reinforcement learning. Supports both Accelerate and NVIDIA NeMo backends, allowing training of models up to 20B+ parameters. Implements PPO and ILQL, and integrates with CHEESE for human-in-the-loop data collection.                                                                                                                                                                                              |\n| 14 | [**Okapi**](https:\u002F\u002Fgithub.com\u002Fnlp-uoregon\u002FOkapi)                                                          | A framework for instruction tuning in LLMs with RLHF, supporting 26 languages. Provides multilingual resources such as ChatGPT prompts, instruction datasets, and response ranking data, along with both BLOOM-based and LLaMa-based models and evaluation benchmarks.                                                                                                                                                                                                                                                 |\n| 15 | [**LLaMA-Factory**](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory)                                              | *Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)*. Supports a wide array of models (e.g., LLaMA, LLaVA, Qwen, Mistral) with methods including pre-training, multimodal fine-tuning, reward modeling, PPO, DPO, and ORPO. Offers scalable tuning (16-bit, LoRA, QLoRA) with advanced optimizations and logging integrations, and provides fast inference via API, Gradio UI, and CLI with vLLM workers.                                                 |\n\n---\n## ⚡ Applications & Benchmarks  \n\n- **\"AutoGPT: LLMs for Autonomous RL Agents\"** - OpenAI (2023) [[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.03442)]  \n- **\"Barkour: Benchmarking LLM-Augmented RL\"** - Wu et al. (2023) [[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.12377)]  \n* Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning (Medical Understanding & Reasoning) [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.03667) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.12-red) [[Code]](https:\u002F\u002Fgithub.com\u002Fai4colonoscopy\u002FColon-X)\n* Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17387) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.03124) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [[Paper]](https:\u002F\u002Fopenreview.net\u002Fforum?id=GN2qbxZlni) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS-2024-blue)\n* Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.21187) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.15277) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.09-red)\n* EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12466) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14739) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14191) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04872) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* Evaluation of OpenAI o1: Opportunities and Challenges of AGI [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.18486) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.09-red)\n* MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06453) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15089) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* Humanity's Last Exam [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.14249) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* LR\u003Csup>2\u003C\u002Fsup>Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17848) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* BIG-Bench Extra Hard [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19187) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n\n---\n\n## 📚 Tutorials & Courses  \n\n- 🎥 **Deep RL Bootcamp (Berkeley)** [[Website](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdeep-rl-bootcamp\u002F)]  \n- 🎥 **DeepMind RL Series** [[Website](https:\u002F\u002Fdeepmind.com\u002Flearning-resources)]  \n\n---\n\n## 🛠️ Libraries & Implementations  \n\n- 🔹 [Decision Transformer (GitHub)](https:\u002F\u002Fgithub.com\u002Fkzl\u002Fdecision-transformer)  \n- 🔹 [ReAct (GitHub)](https:\u002F\u002Fgithub.com\u002Fysymyth\u002FReAct)  \n- 🔹 [RLHF (GitHub)](https:\u002F\u002Fgithub.com\u002Fopenai\u002Flm-human-preferences)  \n\n---\n\n## 🔗 Other Resources  \n\n- [LLM for RL Workshop at NeurIPS 2023](https:\u002F\u002Fneurips.cc)  \n- [OpenAI Research Blog on RLHF](https:\u002F\u002Fopenai.com\u002Fresearch)  \n\n---\n\n## 📌 Contributing  \n\nContributions are welcome! If you have relevant papers, code, or insights, feel free to submit a pull request.  \n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_2d4739b2a7ab.png)](https:\u002F\u002Fwww.star-history.com\u002F#mbzuai-oryx\u002FAwesome-LLM-Post-training&Timeline)\n\n## Citation\n\nIf you find our work useful or use it in your research, please consider citing:\n\n```bibtex\n@misc{kumar2025llmposttrainingdeepdive,\n      title={LLM Post-Training: A Deep Dive into Reasoning Large Language Models}, \n      author={Komal Kumar and Tajamul Ashraf and Omkar Thawakar and Rao Muhammad Anwer and Hisham Cholakkal and Mubarak Shah and Ming-Hsuan Yang and Phillip H. S. Torr and Fahad Shahbaz Khan and Salman Khan},\n      year={2025},\n      eprint={2502.21321},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.21321}, \n}\n```\n\n## License :scroll:\n\u003Ca rel=\"license\" href=\"http:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc-sa\u002F4.0\u002F\">\u003Cimg alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_8a4e76cf0ed2.png\" \u002F>\u003C\u002Fa>\u003Cbr \u002F>This work is licensed under a \u003Ca rel=\"license\" href=\"http:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc-sa\u002F4.0\u002F\">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License\u003C\u002Fa>.\n\n## Star History Chart\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_cc73a0c37a65.png)](https:\u002F\u002Fwww.star-history.com\u002F#mbzuai-oryx\u002FAwesome-LLM-Post-training&type=date&legend=top-left)\n\n\n\n\nLooking forward to your feedback, contributions, and stars! :star2:\nPlease raise any issues or questions [here](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FAwesome-LLM-Post-training\u002Fissues). \n\n\n---\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_45d2297f2f63.png\" width=\"200\" height=\"100\">](https:\u002F\u002Fwww.ival-mbzuai.com)\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_f7ee9d1ef19f.png\" width=\"100\" height=\"100\">](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx)\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_5538daa7b5d2.png\" width=\"360\" height=\"85\">](https:\u002F\u002Fmbzuai.ac.ae)\n","# LLM 后训练：深入探究推理大型语言模型\n\n\n[![MIT 许可证](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-green.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT)  [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2502.21321-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.21321)  [![维护中](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMaintained%3F-yes-green.svg)](https:\u002F\u002Fgithub.com\u002Fzzli2022\u002FSystem2-Reasoning-LLM)\n[![欢迎贡献](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FContributions-welcome-blue)]()\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FwaxVImv.png\" alt=\"Oryx Video-ChatGPT\">\n\u003C\u002Fp>\n\n欢迎来到 **Awesome-LLM-Post-training** 仓库！本仓库精心整理了与 **大型语言模型（LLMs）后训练方法** 相关的最具影响力的论文、代码实现、基准测试及各类资源。\n\n我们的工作基于以下论文：  \n📄 **LLM 后训练：深入探究推理大型语言模型** – 可在 [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2502.21321-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.21321) 上获取。\n\n#### [科马尔·库马尔](mailto:komal.kumar@mbzuai.ac.ae)* , [塔贾穆勒·阿什拉夫](https:\u002F\u002Fwww.tajamulashraf.com)* , [奥姆卡尔·塔瓦卡尔](https:\u002F\u002Fomkarthawakar.github.io\u002Findex.html) , [拉奥·穆罕默德·安韦尔](https:\u002F\u002Fmbzuai.ac.ae\u002Fstudy\u002Ffaculty\u002Frao-muhammad-anwer\u002F) , [希沙姆·乔拉卡尔](https:\u002F\u002Fmbzuai.ac.ae\u002Fstudy\u002Ffaculty\u002Fhisham-cholakkal\u002F) , [穆巴拉克·沙赫](https:\u002F\u002Fwww.crcv.ucf.edu\u002Fperson\u002Fmubarak-shah\u002F) , [杨明轩](https:\u002F\u002Fresearch.google\u002Fpeople\u002F105989\u002F) , [菲利普·H.S. 托尔](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FPhilip_Torr) , [法哈德·沙巴兹·汗](https:\u002F\u002Fsites.google.com\u002Fview\u002Ffahadkhans\u002Fhome) ，以及 [萨尔曼·汗](https:\u002F\u002Fsalman-h-khan.github.io\u002F)  \n\\* 平等贡献的第一作者\n\n \n- **通讯作者:** [科马尔·库马尔](mailto:komal.kumar@mbzuai.ac.ae), [塔贾穆勒·阿什拉夫](https:\u002F\u002Fwww.tajamulashraf.com\u002F)。  \n\n欢迎 ⭐ 收藏并 fork 本仓库，以随时掌握最新进展，并为社区贡献力量。\n\n---\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_b6f634c7e538.jpg\" width=\"45%\" hieght=\"50%\" \u002F>\n\u003C!--   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_ad5d583b5c73.jpg\" width=\"45%\" height=\"50%\" \u002F> -->\n\u003C\u002Fp>\n针对 **LLMs** 的后训练方法分类体系，分为微调、强化学习和测试时缩放方法。我们总结了近期 LLM 模型中使用的关键技术。\n\n---\n\n## 📌 内容  \n\n| 版块 | 子版块 |  \n| ------- | ----------- |  \n| [📖 论文](#papers) | [综述](#survey), [理论](#theory), [可解释性](#explainability) |  \n| [🤖 LLMs 在 RL 中的应用](#LLMs-in-RL) | LLM 增强的强化学习 |  \n| [🏆 奖励学习](#reward-learning) | [人类反馈](#human-feedback), [基于偏好的 RL](#preference-based-rl), [内在动机](#intrinsic-motivation) |  \n| [🚀 策略优化](#policy-optimization) | [离线 RL](#offline-rl), [模仿学习](#imitation-learning), [层次化 RL](#hierarchical-rl) |  \n| [🧠 LLMs 用于推理与决策](#llms-for-reasoning-and-decision-making) | [因果推理](#causal-reasoning), [规划](#planning), [常识强化学习](#commonsense-rl) |  \n| [🌀 探索与泛化](#exploration-and-generalization) | [零样本 RL](#zero-shot-rl), [RL 中的泛化](#generalization-in-rl), [自监督 RL](#self-supervised-rl) |  \n| [🤝 多智能体 RL (MARL)](#multi-agent-rl-marl) | [涌现式通信](#emergent-communication), [协调](#coordination), [社交 RL](#social-rl) |  \n| [⚡ 应用与基准测试](#applications-and-benchmarks) | [自主代理](#autonomous-agents), [模拟](#simulations), [LLM-RL 基准测试](#llm-rl-benchmarks) |  \n| [📚 教程与课程](#tutorials-and-courses) | [讲座](#lectures), [研讨会](#workshops) |  \n| [🛠️ 库与实现](#libraries-and-implementations) | 开源 RL-LLM 框架 |  \n| [🔗 其他资源](#other-resources) | 补充研究与阅读材料 |  \n\n---\n\n# 📖 论文\n\n## 🔍 调查  \n\n| 标题 | 发表日期 | 链接 |\n|---------------------------------|----------------|---------------------------------|\n| VLM与合成数据桥梁作用的调查 | 2025年5月16日 | [OpenReview](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ThjDCZOljE) |\n| 大型语言模型后训练的调查 | 2025年3月8日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.06072) |\n| LLM后训练：深入探讨推理型大型语言模型 | 2025年2月28日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.21321) |\n| 从系统1到系统2：推理型大型语言模型的调查 | 2025年2月25日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17419) |\n| 以逻辑推理赋能LLM：全面调查 | 2025年2月24日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.15652)|\n| 迈向大型推理模型：基于大型语言模型的强化推理调查 | 2025年1月16日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09686) |\n| 大型语言模型的危害性微调攻击与防御：调查   | 2024年9月26日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.18169) |\n| 使用大型语言模型进行推理，一项调查 | 2024年7月16日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11511) |\n| 大型语言模型增强型强化学习的调查：概念、分类与方法 | 2024年3月30日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.00282) |\n| 强化学习增强的LLM：一项调查 | 2024年12月5日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10400) |\n| 在代码生成中利用强化学习提升代码LLM：一项调查 | 2024年12月29日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.20367) |\n| 大型语言模型：其发展、能力与应用的调查 | 2025年1月15日 | [Springer](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002Fs10115-024-02310-4) |\n| 多模态大型语言模型的调查 | 2025年2月10日 | [Oxford Academic](https:\u002F\u002Facademic.oup.com\u002Fnsr\u002Farticle\u002F11\u002F12\u002Fnwae403\u002F7896414) |\n| 大型语言模型（LLMs）：调查、技术框架及未来方向 | 2024年7月20日 | [Springer](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002Fs10462-024-10888-y) |\n| 利用大型语言模型自动化并加速带有奖励机器的强化学习 | 2024年2月11日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.07069) |\n| ExploRLLM：利用大型语言模型引导强化学习中的探索 | 2024年3月14日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09583) |\n| 基于大型语言模型的强化学习问题解决 | 2024年4月29日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.18638) |\n| 针对强化学习的大型语言模型调查 | 2023年12月10日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04567) |\n| 大型语言模型作为决策者：一项调查 | 2023年8月23日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.11749) |\n| 大型语言模型对齐技术的调查 | 2023年5月6日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.00921) |\n| 带有人类反馈的强化学习：一项调查 | 2023年4月12日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.04989) |\n| 基于大型语言模型的推理：一项调查 | 2023年2月14日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.06476) |\n| 面向决策的基础模型调查 | 2023年1月9日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.04150) |\n| 大型语言模型在强化学习中的机遇与挑战 | 2022年12月5日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09142) |\n| 通过人类反馈训练语言模型遵循指令 | 2022年3月4日 | [Arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.02155) |\n\n\n\n---\n\n## 🤖 LLMs-in-RL  \n\n* Vision-Zero：通过战略游戏化自我博弈实现可扩展的VLM自我改进 [[论文]](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2509.25541) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.09-red)\n* Satori：行动-思维链增强的强化学习通过自回归搜索提升LLM推理能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.02508) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* DeepScaleR：通过扩展RL超越O1预览版，仅用15亿参数模型 [[论文]](https:\u002F\u002Fpretty-radio-b75.notion.site\u002FDeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNotion-2025.02-red)\n* QLASS：通过Q引导的逐步搜索提升语言智能体推理能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.02584) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 通过隐式奖励推进强化过程 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 通过强化学习和推理规模扩展推进语言模型推理能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.11651) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* 确保DeepSeek-R1模型AI安全的挑战：强化学习策略的不足 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17030) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* DeepSeek-R1：通过强化学习激励LLM的推理能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12948) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* Kimi k1.5：利用LLM扩展强化学习 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12599) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* RLHF能扩展吗？探索数据、模型和方法的影响 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06000) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* 面向LLM多步推理的离线强化学习 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16145) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* ReFT：面向语言模型的表示微调 [[论文]](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.410.pdf) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FACL-2024-blue)\n* Deepseekmath：突破开放语言模型的数学推理极限 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.03300) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.02-red)\n* 通过强化功能标记调优进行推理 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13389) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 基于价值的深度强化学习可预测地扩展 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04327) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* InfAlign：面向推理的语言模型对齐 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19792) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* LIMR：对于RL扩展而言，少即是多 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11886) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 关于大型语言模型数学领域基于反馈的多步推理的调查 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.143) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n \n\n---\n\n## 🏆 奖励学习（过程奖励模型）\n\n* PRMBench：面向过程级奖励模型的细粒度且具有挑战性的基准测试。[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.03124) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* ReARTeR：基于可信过程奖励的检索增强推理 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07861) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* 在数学推理中开发过程奖励模型的经验教训。[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07301) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* ToolComp：多工具推理与过程监督基准测试。[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01290) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* AutoPSV：自动化过程监督验证器 [[论文]](https:\u002F\u002Fopenreview.net\u002Fforum?id=eOAPWWOGs9) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS-2024-blue)\n* ReST-MCTS*：通过过程奖励引导的树搜索进行大语言模型自训练 [[论文]](https:\u002F\u002Fopenreview.net\u002Fforum?id=8rcFOqEud5) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS-2024-blue)\n* 无需过程标签的自由过程奖励。[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01981) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* 针对代码生成的结果精炼型过程监督 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15118) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* Math-Shepherd：无需人工标注，逐步验证并强化大语言模型 [[论文]](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.510\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FACL-2024-blue)\n* OVM：用于数学推理规划的结果监督价值模型 [[论文]](https:\u002F\u002Faclanthology.org\u002F2024.findings-naacl.55\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FACL_Findings-2024-blue)\n* Step-DPO：针对大语言模型长链推理的分步偏好优化 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.18629) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.06-red)\n* 让我们逐步验证。[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.20050) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.05-red)\n* 通过自动化过程监督提升语言模型的数学推理能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05372) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2023.06-red)\n* 利用步骤感知验证器使大型语言模型成为更好的推理者 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.02336) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2023.06-red)\n* 利用过程与结果反馈解决数学应用题 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.14275) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2022.11-red)\n* 基于生成式奖励模型的不确定性感知分步验证 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11250) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* AdaptiveStep：根据模型置信度自动划分推理步骤 [[论文]](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2502.13943) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 内部奖励模型的自洽性可提升自我奖励语言模型 [[论文]](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2502.08922) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 10亿参数的语言模型能否超越4050亿参数的语言模型？重新思考计算最优的推理时缩放策略 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06703) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 代理式奖励建模：将人类偏好与可验证的正确性信号相结合，构建可靠的奖励系统 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19328) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 凭直觉指导：利用强化的内在信心实现高效的推理时缩放 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20325) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.05-red)\n* 大语言模型与金融的结合：为开放FinLLM排行榜微调基础模型 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.13125) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.04-red)\n---\n\n## 策略优化\n\n* 挤干湿海绵：大型语言模型的高效离策略强化微调 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06892) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.07-red)\n\n---\n\n##  MCTS\u002F树搜索\n* 关于马尔可夫决策过程中最优值估计的MCTS收敛速率 [[论文]](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F10870057\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FIEEE_TAC-2025-blue)\n* Search-o1: 基于智能体搜索增强的大规模推理模型 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05366) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* rStar-Math: 小型LLM可通过自我进化式深度思考掌握数学推理能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04519) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* ReST-MCTS*: 基于过程奖励引导的树搜索实现LLM自训练 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.03816) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* 思维森林：通过增加测试时计算量来提升LLM推理能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.09078) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* 华佗GPT-o1，迈向基于LLM的医学复杂推理 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18925) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* 桑葚：通过集体蒙特卡洛树搜索赋予多模态LLM类似o1的推理与反思能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18319) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* 利用引导式树搜索提出并解决奥数几何问题 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10673) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* SPaR：结合树搜索精炼的自我博弈方法，用于提升大型语言模型的指令遵循能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.11605) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* 基于迭代偏好学习的蒙特卡洛树搜索增强推理中的内在自我修正能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17397) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* CodeTree：大型语言模型辅助代码生成的代理引导树搜索 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04329) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* GPT引导的蒙特卡洛树搜索在金融欺诈检测中的符号回归应用 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04459) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* MC-NEST——利用蒙特卡洛纳什均衡自我精炼树提升大型语言模型的数学推理能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15645) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* Marco-o1：迈向开放式解决方案的开放性推理模型 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14405) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* SRA-MCTS：利用蒙特卡洛树搜索进行代码生成的自我驱动式推理增强 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11053) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* 别丢掉你的价值模型！通过价值引导的蒙特卡洛树搜索解码生成更优文本 [[论文]](https:\u002F\u002Fopenreview.net\u002Fforum?id=kh9Zt2Ldmn#discussion) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCoLM-2024-blue)\n* AFlow：自动化智能体工作流生成 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10762) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* 可解释的对比式蒙特卡洛树搜索推理 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01707) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* LLaMA-Berry：针对O1级别奥数数学推理的成对优化 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02884) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* 通过MCTS实现LLM自我改进：利用循序渐进的知识与课程式偏好学习 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06508) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* TreeBoN：通过推测式树搜索和最佳N次采样提升推理时的一致性 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16033) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* 理解思维树何时奏效：更大规模的模型在生成方面表现更佳，而非在判别方面 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17820) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* RethinkMCTS：为代码生成优化蒙特卡洛树搜索中的错误思路 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.09584) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.09-red)\n* 策略家：LLM通过双层树搜索学习战略技能 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.10635) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.08-red)\n* LiteSearch：高效的LLM树搜索 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.00320) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.07-red)\n* 针对语言模型代理的树搜索 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.01476) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.07-red)\n* 大型语言模型搜索树上的不确定性引导优化 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.03951) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.07-red)\n* * 通过蒙特卡洛树自我精炼，借助LLaMa-3 8B获得GPT-4级别的数学奥赛解法 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07394) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.06-red)\n* 超越A*：通过搜索动力学自举实现更好的Transformer规划 [[论文]](https:\u002F\u002Fopenreview.net\u002Fforum?id=rviGTsl0oy) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FICLR_WorkShop-2024-blue)\n* LLM推理者：大型语言模型逐步推理的新评估、库及分析 [[论文]](https:\u002F\u002Fopenreview.net\u002Fforum?id=h1mvwbQiXR) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FICLR_WorkShop-2024-blue)\n* AlphaMath几乎为零：无需流程监督的过程监督 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.03553) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.05-red)\n* 利用蒙特卡洛树搜索引导大型语言模型生成代码世界模型 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.15383) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.05-red)\n* MindStar：在推理时提升预训练LLM的数学推理能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16265) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.05-red)\n* 通过迭代偏好学习提升蒙特卡洛树搜索的推理能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.00451) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.05-red)\n* 搜索之流（SoS）：学习如何在语言中进行搜索 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03683) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.04-red)\n* 通过想象、搜索和批判实现LLM自我改进 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.12253) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.04-red)\n* 思维的不确定性：不确定性感知规划增强了大型语言模型的信息获取能力 [[论文]](https:\u002F\u002Fopenreview.net\u002Fforum?id=CVpuVe1N22&noteId=aTI8PGpO47) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS-2024-blue)\n* 使用语言模型进行推理即是在使用世界模型进行规划 [[论文]](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.507\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FEMNLP-2023-blue)\n* 大型语言模型作为大规模任务规划中的常识知识 [[论文]](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS-2023-blue)\n* 类似ALPHAZERO的树搜索可以指导大型语言模型的解码和训练 [[论文]](https:\u002F\u002Fopenreview.net\u002Fforum?id=PJfc4x2jXY) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS_WorkShop-2023-blue)\n* 类似Alphazero的树搜索可以指导大型语言模型的解码和训练 [[论文]](https:\u002F\u002Fopenreview.net\u002Fforum?id=PJfc4x2jXY) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS_WorkShop-2023-blue)\n* 让PPO变得更好：价值引导的蒙特卡洛树搜索解码 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15028) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2023.09-red)\n* 利用受限蒙特卡洛树搜索生成可靠的长链式数学推理 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11169) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 基于假设的心智理论推理应用于大型语言模型 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11881) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 细粒度会话解码：基于各向同性和近端搜索 [[论文]](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.5\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FEMNLP-2023-blue)\n* 控制-DAG：使用加权有限状态自动机实现非自回归定向无环T5的约束解码 [[论文]](https:\u002F\u002Faclanthology.org\u002F2024.naacl-short.42\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNAACL-2024-blue)\n* 回溯解码用于开放式文本生成 [[论文]](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.66\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FEMNLP-2023-blue)\n* LeanProgress：通过证明进度预测引导神经定理证明的搜索 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17925) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n\n---\n\n\n\n## 可解释性\n* 快速思考与慢速思考的智能体：讲者-推理者架构 [[论文]](https:\u002F\u002Fopenreview.net\u002Fforum?id=xPhcP6rbI4) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS_WorkShop-2024-blue)\n* 在针对快速与慢速思考进行训练时，大语言模型各层发生了什么？——基于梯度的视角 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23743) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* 当语言模型被优化用于推理时，它是否仍会表现出自回归的痕迹？对OpenAI o1的分析 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01792) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.10-red)\n* 推理步骤长度对大型语言模型的影响 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.04925) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.08-red)\n* 将系统2提炼为系统1 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.06023) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.07-red)\n* 系统2注意力机制（你可能也需要） [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.11829) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2023.11-red)\n* 朝着大语言模型中的系统2推理迈进：通过元思维链学习如何思考 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04682) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* LlamaV-o1：重新思考大语言模型中的分步视觉推理 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06186) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* 两个脑袋胜过一个：推理时的双模型言语反思 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19230) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 探索迭代增强以改进由学习者提供的多选题解释 [[论文]](http:\u002F\u002Farxiv.org\u002Fabs\u002F2309.10444) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAAAI\u002FEAAI-2025-blue)\n* 演绎规则：训练Transformer模型以解释意外输入 [[论文]](https:\u002F\u002Faclanthology.org\u002F2022.findings-acl.19\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FACL_Findings-2022-blue)\n## 多模态智能体相关的快慢系统\n* 深入研究多模态推理的自我进化式训练 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17451) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* 视觉智能体作为快速与慢速思考者 [[论文]](https:\u002F\u002Fopenreview.net\u002Fforum?id=ncCuiD3KJQ) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FICLR-2025-blue)\n* Virgo：关于重现o1-like MLLM的初步探索 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01904) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* 利用视觉价值模型扩展推理时搜索范围，以提升视觉理解能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03704) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* 慢速感知：让我们逐步感知几何图形 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.20631) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* AtomThink：面向多模态数学推理的慢速思考框架 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11930) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* LLaVA-o1：让视觉语言模型能够分步推理 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10440) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* 视觉-语言模型可通过反思实现推理的自我改进 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00855) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* 我思故我扩散：在扩散模型中实现多模态上下文推理 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.10458) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* RAG-Gym：通过过程监督优化推理与搜索智能体 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13957) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n\n## 基准测试与数据集\n* PhyX：你的模型具备物理推理的“智慧”吗？[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15929) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.05-red)\n* Big-Math：面向语言模型强化学习的大规模高质量数学数据集[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17387) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* PRMBench：面向过程级奖励模型的细粒度且具有挑战性的基准测试[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.03124) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* MR-Ben：用于评估大语言模型中系统2思维的元推理基准测试[[论文]](https:\u002F\u002Fopenreview.net\u002Fforum?id=GN2qbxZlni) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS-2024-blue)\n* 对于2+3=?，别想太多：关于o1类大语言模型过度思考的问题[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.21187) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* o1在医学领域的初步研究：我们离人工智能医生更近了吗？[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.15277) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.09-red)\n* EquiBench：通过等价性检查来评估大型语言模型的代码推理能力[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12466) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* SuperGPQA：跨285个研究生学科扩展大语言模型评估[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14739) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 多模态RewardBench：对视觉语言模型奖励模型的全面评估[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14191) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* FrontierMath：评估人工智能高级数学推理能力的基准测试[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04872) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* OpenAI o1评估：AGI的机遇与挑战[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.18486) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.09-red)\n* MATH-Perturb：针对困难扰动项的大语言模型数学推理能力基准测试[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06453) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* LongReason：通过上下文扩展构建的合成长上下文推理基准测试[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15089) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* 人类的最后一场考试[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.14249) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* LR\u003Csup>2\u003C\u002Fsup>Bench：通过约束满足问题评估大型语言模型的长链式反思推理能力[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17848) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* BIG-Bench 极难版[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19187) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 利用任务结构变化评估和增强大型语言模型逻辑推理的鲁棒性[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.09430) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FICONIP-2024-blue)\n* 自然语言上的多步演绎推理：关于分布外泛化能力的实证研究[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.14000) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeSy-2022-blue)\n* 大型语言模型并非强大的抽象推理者[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.19555) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FIJCAI-2024-blue)\n\n## 推理与安全\n* 安全税：安全对齐使你的大型推理模型变得不那么合理[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00555v1) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.03-red)\n* 过度思考：针对推理型大语言模型的减速攻击[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.02542) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* GuardReasoner：迈向基于推理的大语言模型安全保障[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18492) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FICLR_WorkShop-2025-blue)\n* SafeChain：具备长链式思维推理能力的语言模型安全性[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12025) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* ThinkGuard：深思熟虑的慢速思考带来谨慎的安全护栏[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13458) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* SafeChain：具备长链式思维推理能力的语言模型安全性[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12025) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* H-CoT：劫持链式思维安全推理机制以越狱大型推理模型，包括OpenAI o1\u002Fo3、DeepSeek-R1和Gemini 2.0 Flash Thinking[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12893) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* BoT：通过后门攻击破坏o1类大型语言模型的长思维过程[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12202) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 基于抽象语义表示的逻辑驱动数据增强用于逻辑推理[[论文]](https:\u002F\u002Faclanthology.org\u002F2024.findings-acl.353\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FACL_Findings-2024-blue)\n* ChatLogic：将逻辑编程与大型语言模型结合以实现多步推理[[论文]](https:\u002F\u002Fopenreview.net\u002Fforum?id=AOqGF7Po7Z) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAAAI_WorkShop-2024-blue)\n\n---\n## 🚀 强化学习与大语言模型微调仓库\n\n| #  | 仓库与链接                                                                                          | 描述                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |\n|----|------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| 1  | [**RL4VLM**](https:\u002F\u002Fgithub.com\u002FRL4VLM\u002FRL4VLM) \u003Cbr>\u003Cbr> _自2024年12月15日起已归档且只读_       | 提供通过强化学习微调大型视觉-语言模型作为决策代理的代码。包含使用特定任务奖励训练模型并在各种环境中进行评估的实现。                                                                                                                                                                                                                                                                                                         |\n| 2  | [**LlamaGym**](https:\u002F\u002Fgithub.com\u002FKhoomeiK\u002FLlamaGym)                                                      | 简化了使用在线强化学习微调大型语言模型（LLM）代理的过程。提供一个抽象的 `Agent` 类来处理强化学习训练的各个方面，从而能够在不同环境中快速迭代和实验。                                                                                                                                                                                                                                                                                      |\n| 3  | [**基于强化学习的扩散模型在生物序列上的微调**](https:\u002F\u002Fgithub.com\u002Fmasa-ue\u002FRLfinetuning_Diffusion_Bioseq) | 配合一篇关于基于强化学习微调的教程和综述论文，重点在于生物序列（DNA\u002FRNA）的设计。提供了使用强化学习训练和微调扩散模型的全面教程和代码实现。                                                                                                                                                                                                                                     |\n| 4  | [**LM-RL-Finetune**](https:\u002F\u002Fgithub.com\u002Fzhixuan-lin\u002FLM-RL-finetune)                                       | 旨在通过解析计算KL惩罚项来优化语言模型强化学习微调中的KL惩罚优化问题。包含使用近端策略优化（PPO）进行训练的配置。                                                                                                                                                                                                                                                                                                                     |\n| 5  | [**InstructLLaMA**](https:\u002F\u002Fgithub.com\u002Fmichaelnny\u002FInstructLLaMA)                                           | 实现了预训练、监督微调（SFT）以及基于人类反馈的强化学习（RLHF），以训练和微调LLaMA2模型，使其能够遵循人类指令，类似于InstructGPT或ChatGPT。                                                                                                                                                                                                                                                                                                       |\n| 6  | [**SEIKO**](https:\u002F\u002Fgithub.com\u002Fzhaoyl18\u002FSEIKO)                                                             | 介绍了一种新颖的强化学习方法，可在在线设置中高效微调扩散模型。其技术在微调Stable Diffusion方面优于PPO、基于分类器的引导以及直接奖励反向传播等基线方法。                                                                                                                                                                                                                                                                              |\n| 7  | [**TRL（用强化学习训练Transformer语言模型）**](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl)                  | 一个最先进的库，用于使用监督微调（SFT）、近端策略优化（PPO）、GRPO和直接偏好优化（DPO）等方法对基础模型进行后训练。构建于🤗 Transformers生态系统之上，支持多种模型架构，并能高效地扩展到不同的硬件配置。                                                                                                                                                                                     |\n| 8  | [**将强化学习模型的微调视为持续学习**](https:\u002F\u002Fgithub.com\u002FBartekCupial\u002Ffinetuning-RL-as-CL) | 探讨将强化学习模型的微调视为遗忘缓解问题（持续学习）。提供了应对强化学习模型遗忘现象的见解和代码实现。                                                                                                                                                                                                                                                                                                                                                        |\n| 9  | [**RL4LMs**](https:\u002F\u002Fgithub.com\u002Fallenai\u002FRL4LMs)                                                            | 一个模块化的强化学习库，用于将语言模型微调至符合人类偏好。通过GRUE基准测试进行了2000多次严格评估，确保其在各类NLP任务中的鲁棒性。                                                                                                                                                                                                                                                                                                                             |\n| 10 | [**Lamorel**](https:\u002F\u002Fgithub.com\u002Fflowersteam\u002Flamorel)                                                      | 一种高吞吐量、分布式架构，用于在交互式环境中无缝集成LLM。虽然默认情况下不专门针对强化学习或RLHF，但它支持自定义实现，非常适合需要最大灵活性的用户。                                                                                                                     |\n| 11 | [**LLM-Reverse-Curriculum-RL**](https:\u002F\u002Fgithub.com\u002FWooooDyy\u002FLLM-Reverse-Curriculum-RL)                     | 实现了ICML 2024论文《通过逆向课程强化学习训练大型语言模型以进行推理》。专注于使用逆向课程强化学习方法提升LLM的推理能力。                                                                                                                                                                                                                                                  |\n| 12 | [**veRL**](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl)                                                             | 一个灵活、高效且可用于生产的强化学习训练库，专为大型语言模型（LLMs）设计。它是HybridFlow框架的开源实现，支持多种强化学习算法（PPO、GRPO），具备先进的资源利用率，并可扩展至数百张GPU上的70B参数模型。与Hugging Face模型、监督微调以及RLHF结合，支持多种奖励类型。                                                  |\n| 13 | [**trlX**](https:\u002F\u002Fgithub.com\u002FCarperAI\u002Ftrlx)                                                               | 一个用于通过强化学习微调大型语言模型（LLMs）的分布式训练框架。同时支持Accelerate和NVIDIA NeMo后端，允许训练高达20B+参数的模型。实现了PPO和ILQL，并与CHEESE集成以进行人机协作的数据收集。                                                                                                                                                                                              |\n| 14 | [**Okapi**](https:\u002F\u002Fgithub.com\u002Fnlp-uoregon\u002FOkapi)                                                          | 一个用于LLMs中基于RLHF的指令微调框架，支持26种语言。提供多语言资源，如ChatGPT提示词、指令数据集和响应排序数据，以及基于BLOOM和LLaMa的模型和评估基准。                                                                                                                                                                                                                                                 |\n| 15 | [**LLaMA-Factory**](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory)                                              | *统一高效的100+ LLMs和VLMs微调（ACL 2024）*。支持广泛的模型（如LLaMA、LLaVA、Qwen、Mistral），采用的方法包括预训练、多模态微调、奖励建模、PPO、DPO和ORPO。提供可扩展的微调方式（16位、LoRA、QLoRA），并配有高级优化和日志集成；同时通过API、Gradio UI和CLI配合vLLM工作进程实现快速推理。                                                 |\n\n---\n\n\n## ⚡ 应用与基准测试  \n\n- **\"AutoGPT：用于自主强化学习智能体的大型语言模型\"** - OpenAI（2023）[[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.03442)]  \n- **\"Barkour：评估LLM增强型强化学习的基准测试\"** - Wu等人（2023）[[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.12377)]  \n* Colon-X：从多模态理解到临床推理，推动智能结肠镜检查的进步（医学理解与推理）[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.03667) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.12-red) [[代码]](https:\u002F\u002Fgithub.com\u002Fai4colonoscopy\u002FColon-X)\n* Big-Math：面向语言模型强化学习的大规模高质量数学数据集 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17387) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* PRMBench：面向过程级奖励模型的细粒度且具有挑战性的基准测试 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.03124) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* MR-Ben：用于评估大型语言模型中系统2思维的元推理基准测试 [[论文]](https:\u002F\u002Fopenreview.net\u002Fforum?id=GN2qbxZlni) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS-2024-blue)\n* 对于2+3=?不要想太多：关于o1类大型语言模型过度思考的问题 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.21187) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.12-red)\n* o1在医学领域的初步研究：我们离人工智能医生更近了吗？[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.15277) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.09-red)\n* EquiBench：通过等价性检查评估大型语言模型的代码推理能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12466) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* SuperGPQA：跨285个研究生学科扩展大型语言模型评估 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14739) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* 多模态RewardBench：视觉语言模型奖励模型的整体评估 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14191) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* FrontierMath：评估人工智能高级数学推理能力的基准测试 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04872) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.11-red)\n* OpenAI o1评估：AGI的机遇与挑战 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.18486) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2024.09-red)\n* MATH-Perturb：针对高难度扰动评估大型语言模型数学推理能力的基准测试 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06453) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* LongReason：通过上下文扩展构建的合成长上下文推理基准测试 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15089) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* 人类的最后一场考试 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.14249) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.01-red)\n* LR\u003Csup>2\u003C\u002Fsup>Bench：通过约束满足问题评估大型语言模型的长链反思推理能力 [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17848) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n* BIG-Bench Extra Hard [[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19187) ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.02-red)\n\n---\n\n## 📚 教程与课程  \n\n- 🎥 **深度强化学习训练营（伯克利）** [[官网](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdeep-rl-bootcamp\u002F)]  \n- 🎥 **DeepMind强化学习系列** [[官网](https:\u002F\u002Fdeepmind.com\u002Flearning-resources)]  \n\n---\n\n## 🛠️ 库与实现  \n\n- 🔹 [决策 Transformer（GitHub）](https:\u002F\u002Fgithub.com\u002Fkzl\u002Fdecision-transformer)  \n- 🔹 [ReAct（GitHub）](https:\u002F\u002Fgithub.com\u002Fysymyth\u002FReAct)  \n- 🔹 [RLHF（GitHub）](https:\u002F\u002Fgithub.com\u002Fopenai\u002Flm-human-preferences)  \n\n---\n\n## 🔗 其他资源  \n\n- [2023年NeurIPS大会上的LLM与RL研讨会](https:\u002F\u002Fneurips.cc)  \n- [OpenAI关于RLHF的研究博客](https:\u002F\u002Fopenai.com\u002Fresearch)  \n\n---\n\n## 📌 贡献说明  \n\n欢迎各位贡献！如果您有相关的论文、代码或见解，请随时提交拉取请求。  \n\n[![星标历史图](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_2d4739b2a7ab.png)](https:\u002F\u002Fwww.star-history.com\u002F#mbzuai-oryx\u002FAwesome-LLM-Post-training&Timeline)\n\n## 引用\n\n如果您觉得我们的工作有用，或在您的研究中使用了它，请考虑引用：\n\n```bibtex\n@misc{kumar2025llmposttrainingdeepdive,\n      title={LLM后训练：深入探索大型语言模型的推理能力}, \n      author={Komal Kumar和Tajamul Ashraf和Omkar Thawakar和Rao Muhammad Anwer和Hisham Cholakkal和Mubarak Shah和Ming-Hsuan Yang和Phillip H. S. Torr和Fahad Shahbaz Khan和Salman Khan},\n      year={2025},\n      eprint={2502.21321},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.21321}, \n}\n```\n\n## 许可证 :scroll:\n\u003Ca rel=\"license\" href=\"http:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc-sa\u002F4.0\u002F\">\u003Cimg alt=\"知识共享许可\" style=\"border-width:0\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_8a4e76cf0ed2.png\" \u002F>\u003C\u002Fa>\u003Cbr \u002F>本作品采用\u003Ca rel=\"license\" href=\"http:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc-sa\u002F4.0\u002F\">知识共享署名-非商业性使用-相同方式共享4.0国际许可协议\u003C\u002Fa>授权。\n\n## 星标历史图\n[![星标历史图](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_cc73a0c37a65.png)](https:\u002F\u002Fwww.star-history.com\u002F#mbzuai-oryx\u002FAwesome-LLM-Post-training&type=date&legend=top-left)\n\n\n\n\n期待您的反馈、贡献和星标！ :star2:\n如有任何问题或建议，请在此处提出 [这里](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FAwesome-LLM-Post-training\u002Fissues). \n\n\n---\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_45d2297f2f63.png\" width=\"200\" height=\"100\">](https:\u002F\u002Fwww.ival-mbzuai.com)\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_f7ee9d1ef19f.png\" width=\"100\" height=\"100\">](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx)\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_readme_5538daa7b5d2.png\" width=\"360\" height=\"85\">](https:\u002F\u002Fmbzuai.ac.ae)","# Awesome-LLM-Post-training 快速上手指南\n\n**Awesome-LLM-Post-training** 并非一个可直接安装的单一软件库，而是一个精选的**资源知识库**。它汇集了关于大语言模型（LLM）后训练（Post-Training）、推理增强、强化学习（RL）及奖励建模等领域的顶尖论文、代码实现、基准测试和教程。\n\n本指南将帮助开发者快速利用该仓库获取核心资源，并搭建相关的实验环境。\n\n## 📌 环境准备\n\n由于本仓库主要提供论文链接和对应开源项目的索引，您需要根据具体想复现的论文或模型来配置环境。以下是通用的推荐基础环境：\n\n### 系统要求\n*   **操作系统**: Linux (推荐 Ubuntu 20.04\u002F22.04) 或 macOS\n*   **硬件**: NVIDIA GPU (建议显存 ≥ 24GB，用于运行 7B+ 参数模型的微调或推理)\n*   **CUDA**: 版本 ≥ 11.8 (取决于具体框架需求)\n\n### 前置依赖\n大多数列出的项目基于以下主流深度学习框架：\n*   **Python**: 3.9 - 3.11\n*   **PyTorch**: 2.0+\n*   **Transformers**: 4.30+\n*   **Accelerate \u002F DeepSpeed**: 用于分布式训练\n*   **vLLM \u002F TGI**: 用于高效推理\n\n> 💡 **国内加速建议**：\n> 在安装 Python 依赖时，推荐使用清华或阿里镜像源以提升下载速度：\n> ```bash\n> pip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 🚀 安装步骤\n\n由于这是一个资源列表仓库，您只需克隆该仓库以获取最新的论文清单和代码链接。\n\n### 1. 克隆仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fzzli2022\u002FSystem2-Reasoning-LLM.git\ncd System2-Reasoning-LLM\n```\n\n### 2. 获取特定项目代码\n浏览仓库中的 `README.md` 或分类列表（如 `LLMs-in-RL`, `Reward Learning`），找到您感兴趣的项目（例如 *DeepSeek-R1*, *ReFT*, *Kimi k1.5* 等）。\n\n点击对应的 `[Paper]` 或代码链接跳转到原始项目页面。通常安装方式如下（以典型的 RLHF 项目为例）：\n\n```bash\n# 示例：假设您选择了某个具体的开源实现项目\ngit clone \u003C目标项目仓库地址>\ncd \u003C目标项目目录>\n\n# 创建虚拟环境\npython -m venv venv\nsource venv\u002Fbin\u002Factivate\n\n# 安装依赖 (优先使用国内镜像)\npip install -e . -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 🛠️ 基本使用\n\n本仓库的核心用法是**作为技术选型和文献检索的入口**。以下是典型的使用流程：\n\n### 场景：寻找并复现一个“过程奖励模型 (Process Reward Model)\"\n\n1.  **检索资源**：\n    在克隆后的 `README.md` 中查找 `🏆 Reward Learning` 章节。\n    找到相关论文，例如：*ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search。\n\n2.  **定位代码**：\n    点击该条目对应的链接（通常会指向 arXiv 论文或 GitHub 代码库）。\n\n3.  **运行示例**：\n    进入该项目的主页后，参照其提供的 Quick Start 进行推理或训练。以下是一个通用的推理命令示例（具体参数需参考对应项目文档）：\n\n    ```python\n    # 示例：使用 HuggingFace Transformers 加载经过后训练的模型\n    from transformers import AutoModelForCausalLM, AutoTokenizer\n\n    model_name = \"path\u002Fto\u002Fdownloaded\u002Fmodel\" # 替换为具体模型路径\n    tokenizer = AutoTokenizer.from_pretrained(model_name)\n    model = AutoModelForCausalLM.from_pretrained(model_name, device_map=\"auto\")\n\n    input_text = \"Solve this math problem step by step: 1 + 1 = ?\"\n    inputs = tokenizer(input_text, return_tensors=\"pt\").to(model.device)\n\n    outputs = model.generate(**inputs, max_new_tokens=512)\n    print(tokenizer.decode(outputs[0], skip_special_tokens=True))\n    ```\n\n### 核心分类速查\n根据您的研究需求，可直接跳转至 README 中的以下板块：\n*   **📖 Papers\u002FSurvey**: 了解后训练领域的最新综述和理论进展。\n*   **🤖 LLMs in RL**: 查找结合强化学习的推理增强方法（如 DeepScaleR, Satori）。\n*   **🏆 Reward Learning**: 获取过程奖励模型（PRM）和人类反馈（RLHF）的相关实现。\n*   **🛠️ Libraries**: 寻找开源的 RL-LLM 训练框架。\n\n---\n*注：本仓库持续更新，建议定期 `git pull` 获取最新的论文列表和资源链接。*","某金融科技公司的算法团队正致力于将通用大模型改造为具备复杂逻辑推理能力的智能投顾助手，以处理高风险的投资决策分析。\n\n### 没有 Awesome-LLM-Post-training 时\n- **技术路线迷茫**：面对微调、强化学习（RL）和测试时扩展等多种后训练范式，团队难以快速厘清哪种方案最适合金融推理场景，导致选型周期长达数周。\n- **资源检索低效**：研究人员需分散在 arXiv、GitHub 和各大学术博客中手动筛选论文与代码，极易遗漏如“基于偏好的 RL\"或“因果推理”等关键前沿成果。\n- **复现门槛过高**：缺乏系统化的开源框架指引，团队在尝试复现 SOTA（最先进）模型时，常因缺少标准的奖励学习或策略优化实现而反复踩坑。\n- **理论实践脱节**：资深工程师难以将抽象的“内在动机”或“分层 RL\"理论直接映射到具体的代码库，导致算法落地进度严重滞后。\n\n### 使用 Awesome-LLM-Post-training 后\n- **路径清晰明确**：借助其清晰的分类体系（微调\u002FRL\u002F测试时扩展），团队迅速锁定了“偏好强化学习”作为核心优化方向，将技术选型时间缩短至 2 天。\n- **一站式资源聚合**：直接获取 curated 的最新论文、基准测试及代码实现，特别是“奖励学习”与“自主智能体”板块，让团队无缝对接了行业最新进展。\n- **工程落地加速**：利用收录的开源 RL-LLM 框架库，团队快速搭建了包含人类反馈（Human Feedback）的训练流水线，显著降低了复现成本。\n- **知行合一**：通过关联的教程与课程链接，团队成员迅速理解了“因果推理”在决策中的具体应用，并成功将其转化为可运行的代码模块。\n\nAwesome-LLM-Post-training 通过构建从理论综述到代码实现的完整闭环，将大模型后训练的探索效率提升了数倍，让研发团队能专注于核心业务逻辑而非基础基建。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_Awesome-LLM-Post-training_03ca6403.png","mbzuai-oryx","ORYX","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmbzuai-oryx_e1ef1b3c.jpg","A Library for Large Vision-Language Models",null,"https:\u002F\u002Fival-mbzuai.com","https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx",[22],{"name":23,"color":24,"percentage":25},"Python","#3572A5",100,2370,157,"2026-04-19T09:53:10",1,"","未说明",{"notes":33,"python":31,"dependencies":34},"该仓库（Awesome-LLM-Post-training）是一个 curated collection（精选合集），主要收录了关于大语言模型后训练方法（如微调、强化学习、测试时扩展等）的论文、代码实现、基准测试和资源列表。它本身不是一个可直接运行的单一软件工具或框架，因此 README 中未提供具体的操作系统、GPU、内存、Python 版本或依赖库的安装需求。用户需根据列表中引用的具体论文或子项目（如 DeepSeek-R1, ReFT 等）的独立仓库来获取相应的运行环境配置。",[],[36,37],"开发框架","语言模型",[39,40,41,42,43,44],"fine","large-language-models","post-training","reasoning","reinforcement-learning","scaling",2,"ready","2026-03-27T02:49:30.150509","2026-04-20T19:42:11.403192",[],[],[52,64,73,81,89,97],{"id":53,"name":54,"github_repo":55,"description_zh":56,"stars":57,"difficulty_score":58,"last_commit_at":59,"category_tags":60,"status":46},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[61,36,62,63],"Agent","图像","数据工具",{"id":65,"name":66,"github_repo":67,"description_zh":68,"stars":69,"difficulty_score":45,"last_commit_at":70,"category_tags":71,"status":46},9989,"n8n","n8n-io\u002Fn8n","n8n 是一款面向技术团队的公平代码（fair-code）工作流自动化平台，旨在让用户在享受低代码快速构建便利的同时，保留编写自定义代码的灵活性。它主要解决了传统自动化工具要么过于封闭难以扩展、要么完全依赖手写代码效率低下的痛点，帮助用户轻松连接 400 多种应用与服务，实现复杂业务流程的自动化。\n\nn8n 特别适合开发者、工程师以及具备一定技术背景的业务人员使用。其核心亮点在于“按需编码”：既可以通过直观的可视化界面拖拽节点搭建流程，也能随时插入 JavaScript 或 Python 代码、调用 npm 包来处理复杂逻辑。此外，n8n 原生集成了基于 LangChain 的 AI 能力，支持用户利用自有数据和模型构建智能体工作流。在部署方面，n8n 提供极高的自由度，支持完全自托管以保障数据隐私和控制权，也提供云端服务选项。凭借活跃的社区生态和数百个现成模板，n8n 让构建强大且可控的自动化系统变得简单高效。",184740,"2026-04-19T23:22:26",[63,36,61,62,72],"插件",{"id":74,"name":75,"github_repo":76,"description_zh":77,"stars":78,"difficulty_score":58,"last_commit_at":79,"category_tags":80,"status":46},10095,"AutoGPT","Significant-Gravitas\u002FAutoGPT","AutoGPT 是一个旨在让每个人都能轻松使用和构建 AI 的强大平台，核心功能是帮助用户创建、部署和管理能够自动执行复杂任务的连续型 AI 智能体。它解决了传统 AI 应用中需要频繁人工干预、难以自动化长流程工作的痛点，让用户只需设定目标，AI 即可自主规划步骤、调用工具并持续运行直至完成任务。\n\n无论是开发者、研究人员，还是希望提升工作效率的普通用户，都能从 AutoGPT 中受益。开发者可利用其低代码界面快速定制专属智能体；研究人员能基于开源架构探索多智能体协作机制；而非技术背景用户也可直接选用预置的智能体模板，立即投入实际工作场景。\n\nAutoGPT 的技术亮点在于其模块化“积木式”工作流设计——用户通过连接功能块即可构建复杂逻辑，每个块负责单一动作，灵活且易于调试。同时，平台支持本地自托管与云端部署两种模式，兼顾数据隐私与使用便捷性。配合完善的文档和一键安装脚本，即使是初次接触的用户也能在几分钟内启动自己的第一个 AI 智能体。AutoGPT 正致力于降低 AI 应用门槛，让人人都能成为 AI 的创造者与受益者。",183572,"2026-04-20T04:47:55",[61,37,72,36,62],{"id":82,"name":83,"github_repo":84,"description_zh":85,"stars":86,"difficulty_score":58,"last_commit_at":87,"category_tags":88,"status":46},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[36,62,61],{"id":90,"name":91,"github_repo":92,"description_zh":93,"stars":94,"difficulty_score":45,"last_commit_at":95,"category_tags":96,"status":46},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",161692,"2026-04-20T11:33:57",[36,61,37],{"id":98,"name":99,"github_repo":100,"description_zh":101,"stars":102,"difficulty_score":45,"last_commit_at":103,"category_tags":104,"status":46},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[36,62,61]]