[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-andri27-ts--Reinforcement-Learning":3,"tool-andri27-ts--Reinforcement-Learning":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159636,2,"2026-04-17T23:33:34",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":78,"owner_email":77,"owner_twitter":77,"owner_website":79,"owner_url":80,"languages":81,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":32,"env_os":94,"env_gpu":95,"env_ram":94,"env_deps":96,"category_tags":102,"github_topics":103,"view_count":32,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":116,"updated_at":117,"faqs":118,"releases":119},8781,"andri27-ts\u002FReinforcement-Learning","Reinforcement-Learning","Learn Deep Reinforcement Learning in 60 days! Lectures & Code in Python. Reinforcement Learning + Deep Learning","Reinforcement-Learning 是一个专为希望在 60 天内系统掌握深度强化学习（Deep RL）的开发者与研究爱好者打造的开源教程项目。它巧妙结合了神经网络与强化学习理论，旨在帮助用户理解并复现如 AlphaGo Zero 和 OpenAI Dota 2 等前沿成果背后的核心技术。\n\n该项目主要解决了初学者在面对复杂强化学习算法时“理论难懂、代码难写”的痛点。通过结构化的周计划课程，它引导用户从基础的 Q-learning 入手，逐步进阶到深度 Q 网络（DQN）、A2C、PPO 等高级策略梯度算法。所有示例均使用 Python 和 PyTorch 编写，并在 OpenAI Gym 的 RoboSchool 及 Atari 环境中进行了实战测试，确保理论与实践紧密结合。\n\n资源内容精选自 DeepMind 和伯克利大学的公开讲座，不仅提供了清晰的算法实现，还配套了详细的学习路径索引。此外，作者还出版了相关专著以供深入研读。无论是具备一定 Python 和深度学习基础的工程师，还是希望将通用智能技术应用于现实问题的研究人员，都能在此找到宝贵的学习素材。加入该项目的学习挑战","Reinforcement-Learning 是一个专为希望在 60 天内系统掌握深度强化学习（Deep RL）的开发者与研究爱好者打造的开源教程项目。它巧妙结合了神经网络与强化学习理论，旨在帮助用户理解并复现如 AlphaGo Zero 和 OpenAI Dota 2 等前沿成果背后的核心技术。\n\n该项目主要解决了初学者在面对复杂强化学习算法时“理论难懂、代码难写”的痛点。通过结构化的周计划课程，它引导用户从基础的 Q-learning 入手，逐步进阶到深度 Q 网络（DQN）、A2C、PPO 等高级策略梯度算法。所有示例均使用 Python 和 PyTorch 编写，并在 OpenAI Gym 的 RoboSchool 及 Atari 环境中进行了实战测试，确保理论与实践紧密结合。\n\n资源内容精选自 DeepMind 和伯克利大学的公开讲座，不仅提供了清晰的算法实现，还配套了详细的学习路径索引。此外，作者还出版了相关专著以供深入研读。无论是具备一定 Python 和深度学习基础的工程师，还是希望将通用智能技术应用于现实问题的研究人员，都能在此找到宝贵的学习素材。加入该项目的学习挑战，你将获得一套完整的技术栈，从而自信地开启自己的智能体开发之旅。","\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_7a51165c63b3.png)\n\n## Course in Deep Reinforcement Learning\n\n### Explore the combination of neural network and reinforcement learning. Algorithms and examples in Python & PyTorch\n\n\nHave you heard about the amazing results achieved by [Deepmind with AlphaGo Zero](https:\u002F\u002Fwww.youtube.com\u002Fwatch?time_continue=24&v=tXlM99xPQC8) and by [OpenAI in Dota 2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=l92J1UvHf6M)? It's all about deep neural networks and reinforcement learning. Do you want to know more about it?  \nThis is the right opportunity for you to finally learn Deep RL and use it on new and exciting projects and applications.  \n\nHere you'll find an in depth introduction to these algorithms. Among which you'll learn q learning, deep q learning, PPO, actor critic, and implement them using Python and PyTorch.\n\n> The ultimate aim is to use these general-purpose technologies and apply them to all sorts of important real world problems.\n> **Demis Hassabis**\n\n\nThis repository contains:  \n\n\u003Cbr>\n\n\u003Cimg align=\"left\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"64\"\u002F> **Lectures (& other content) primarily from DeepMind and Berkley Youtube's Channel.**\n\n\u003Cbr>\n\n\u003Cimg align=\"left\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"64\"\u002F> **Algorithms (like DQN, A2C, and PPO) implemented in PyTorch and tested on OpenAI Gym: RoboSchool & Atari.**\n\n\n\u003Cbr>\n\u003Cbr>\n\n**Stay tuned and follow me on** [![Twitter Follow](https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Ffollow\u002Fespadrine.svg?style=social&label=Follow)](https:\u002F\u002Ftwitter.com\u002Fandri27_it) and [![GitHub followers](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Ffollowers\u002Fespadrine.svg?style=social&label=Follow)](https:\u002F\u002Fgithub.com\u002Fandri27-ts)   **#60DaysRLChallenge**\n\nNow we have also a [**Slack channel**](https:\u002F\u002F60daysrlchallenge.slack.com\u002F). To get an invitation, email me at andrea.lonza@gmail.com. Also, email me if you have any idea, suggestion or improvement.  \n\nTo learn Deep Learning, Computer Vision or Natural Language Processing check my **[1-Year-ML-Journey](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F1-Year-ML-Journey)**\n\n\n### Before starting.. Prerequisites\n* Basic level of Python and PyTorch\n* [Machine Learning](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F1-Year-ML-Journey)\n* [Basic knowledge in Deep Learning (MLP, CNN and RNN)](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2N3AIlp?tag=andreaaffilia-20)\n\n\u003Cbr>\n\u003Cbr>\n\n\n## Quick Note: my NEW BOOK is out!\nTo learn Reinforcement Learning and Deep RL more in depth, check out my book [**Reinforcement Learning Algorithms with Python**](https:\u002F\u002Fwww.amazon.com\u002FReinforcement-Learning-Algorithms-Python-understand\u002Fdp\u002F1789131111)!!\n\n\u003Ca href=\"https:\u002F\u002Fwww.amazon.com\u002FReinforcement-Learning-Algorithms-Python-understand\u002Fdp\u002F1789131111\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_c6c37de1f752.jpg\" alt=\"drawing\" width=\"350\" align=\"right\"\u002F>\n\u003C\u002Fa>\n\n**Table of Contents**\n1. The Landscape of Reinforcement Learning\n2. Implementing RL Cycle and OpenAI Gym\n3. Solving Problems with Dynamic Programming\n4. Q learning and SARSA Applications\n5. Deep Q-Network\n6. Learning Stochastic and DDPG optimization\n7. TRPO and PPO implementation\n8. DDPG and TD3 Applications\n9. Model-Based RL\n10. Imitation Learning with the DAgger Algorithm\n11. Understanding Black-Box Optimization Algorithms\n12. Developing the ESBAS Algorithm\n13. Practical Implementation for Resolving RL Challenges\n\n\n\u003Cbr>\n\u003Cbr>\n\u003Cbr>\n\n\n## Index - Reinforcement Learning\n \n - [Week 1 - **Introduction**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-1---introduction)\n - [Week 2 - **RL Basics**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-2---rl-basics-mdp-dynamic-programming-and-model-free-control)\n - [Week 3 - **Value based algorithms - DQN**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-3---value-function-approximation-and-dqn)\n - [Week 4 - **Policy gradient algorithms - REINFORCE & A2C**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-4---policy-gradient-methods-and-a2c)\n - [Week 5 - **Advanced Policy Gradients - PPO**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-5---advanced-policy-gradients---trpo--ppo)\n - [Week 6 - **Evolution Strategies and Genetic Algorithms - ES**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-6---evolution-strategies-and-genetic-algorithms)\n - [Week 7 - **Model-Based reinforcement learning - MB-MF**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-7---model-based-reinforcement-learning)\n - [Week 8 - **Advanced Concepts and Project Of Your Choice**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge\u002Fblob\u002Fmaster\u002FREADME.md#week-8---advanced-concepts-and-project-of-your-choice)\n - [Last 4 days - **Review + Sharing**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge\u002Fblob\u002Fmaster\u002FREADME.md#last-4-days---review--sharing)\n - [Best resources](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#best-resources)\n - [Additional resources](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#additional-resources)\n\u003Cbr>\n\n## Week 1 - Introduction\n\n- **[Why is Reinforcement Learning such an important learning method - A simple explanation](https:\u002F\u002Fmedium.com\u002F@andrea.lonzats\u002Fthe-learning-machines-fb922e539335)**\n- **[Introduction and course overview](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Q4kF8sfggoI&index=1&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3) - CS294 by Levine, Berkley**\n- **[Deep Reinforcement Learning: Pong from Pixels](http:\u002F\u002Fkarpathy.github.io\u002F2016\u002F05\u002F31\u002Frl\u002F) by Karpathy**\n\n##\n\n#### Other Resources\n\n- :books: [The \"Bible\" of Reinforcement Learning: Chapter 1](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2HRSSmh?tag=andreaaffilia-20) - Sutton & Barto\n- Great introductory paper: [Deep Reinforcement Learning: An Overview](https:\u002F\u002Fwww.groundai.com\u002Fproject\u002Fdeep-reinforcement-learning-an-overview\u002F)\n- Start coding: [From Scratch: AI Balancing Act in 50 Lines of Python](https:\u002F\u002Ftowardsdatascience.com\u002Ffrom-scratch-ai-balancing-act-in-50-lines-of-python-7ea67ef717)\n\n\u003Cbr>\n\n## Week 2 - RL Basics: *MDP, Dynamic Programming and Model-Free Control*\n\n> Those who cannot remember the past are condemned to repeat it - **George Santayana**\n\n\nThis week, we will learn about the basic blocks of reinforcement learning, starting from the definition of the problem all the way through the estimation and optimization of the functions that are used to express the quality of a policy or state.\n\n##\n\n### Lectures - Theory \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n\n* **[Markov Decision Process](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=lfHX2hHRMVQ&list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-&index=2) - David Silver (DeepMind)**\n  * Markov Processes\n  * Markov Decision Processes\n\n- **[Planning by Dynamic Programming](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Nd1-UUMVfz4&list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-&index=3) - David Silver (DeepMind)**\n  * Policy iteration\n  * Value iteration\n\n* **[Model-Free Prediction](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=PnHCvfgC_ZA&index=4&list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-) - David Silver (DeepMind)**\n  * Monte Carlo Learning\n  * Temporal Difference Learning\n  * TD(λ)\n\n- **[Model-Free Control](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=0g4j2k_Ggc4&list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-&index=5) - David Silver (DeepMind)**\n  * Ɛ-greedy policy iteration\n  * GLIE Monte Carlo Search\n  * SARSA\n  * Importance Sampling\n\n##\n\n### Project of the Week - [**Q-learning**](Week2\u002Ffrozenlake_Qlearning.ipynb) \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"48\"\u002F>\n\n[**Q-learning applied to FrozenLake**](Week2\u002Ffrozenlake_Qlearning.ipynb) - For exercise, you can solve the game using SARSA or implement Q-learning by yourself. In the former case, only few changes are needed.\n\n##\n\n#### Other Resources\n- :books: [The \"Bible\" of Reinforcement Learning: Chapters 3 and 4](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2HRSSmh?tag=andreaaffilia-20) - Sutton & Barto\n- :tv: [Value functions introduction](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=k1vNh4rNYec&index=6&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3) - DRL UC Berkley by Sergey Levine\n\n\u003Cbr>\n\n## Week 3 - Value based algorithms - DQN\n\nThis week we'll learn more advanced concepts and apply deep neural network to Q-learning algorithms.\n\n##\n\n### Lectures - Theory \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n- **[Value functions approximation](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UoPei5o4fps&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=6) - David Silver (DeepMind)**\n  - Differentiable function approximators\n  - Incremental methods\n  - Batch methods (DQN)\n\n* **[Advanced Q-learning algorithms](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=nZXC5OdDfs4&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&index=7) - Sergey Levine (UC Berkley)**\n  - Replay Buffer\n  - Double Q-learning\n  - Continous actions (NAF,DDPG)\n  - Pratical tips\n\n##\n\n### Project of the Week - [**DQN and variants**](Week3) \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"48\"\u002F>\n\n\n\u003Cimg align=\"left\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_51febd650e6d.gif\" alt=\"drawing\" width=\"200\"\u002F> \n\n[**DQN and some variants applied to Pong**](Week3) - This week the goal is to develop a DQN algorithm to play an Atari game. To make it more interesting I developed three extensions of DQN: **Double Q-learning**, **Multi-step learning**, **Dueling networks** and **Noisy Nets**. Play with them, and if you feel confident, you can implement Prioritized replay, Dueling networks or Distributional RL. To know more about these improvements read the papers!\n\n\u003Cbr clear=\"left\"\u002F>\n\n##\n\n\n#### Papers\n\n##### Must Read\n - [Playing Atari with Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1312.5602.pdf) - 2013\n - [Human-level control through deep reinforcement learning](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fdqn\u002FDQNNaturePaper.pdf) - 2015\n - [Rainbow: Combining Improvements in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1710.02298.pdf) - 2017\n\n##### Extensions of DQN\n - [Deep Reinforcement Learning with Double Q-learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1509.06461.pdf) - 2015\n - [Prioritized Experience Replay](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1511.05952.pdf) - 2015\n - [Dueling Network Architectures for Deep Reinforcement Learning](http:\u002F\u002Fproceedings.mlr.press\u002Fv48\u002Fwangf16.pdf) - 2016\n - [Noisy networks for exploration](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1706.10295.pdf) - 2017\n - [Distributional Reinforcement Learning with Quantile Regression](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1710.10044.pdf) - 2017\n \n#### Other Resources\n  - :books: [The \"Bible\" of Reinforcement Learning: Chapters 5 and 6](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2HRSSmh?tag=andreaaffilia-20) - Sutton & Barto\n  - :tv: [Deep Reinforcement Learning in the Enterprise: Bridging the Gap from Games to Industry](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=GOsUHlr4DKE)\n\n\u003Cbr>\n\n## Week 4 - Policy gradient algorithms - REINFORCE & A2C\n\nWeek 4 introduce Policy Gradient methods, a class of algorithms that optimize directly the policy. Also, you'll learn about Actor-Critic algorithms. These algorithms combine both policy gradient (the actor) and value function (the critic).\n\n##\n\n### Lectures - Theory \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n* **[Policy gradient Methods](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=KHZVXao4qXs&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=7) - David Silver (DeepMind)**\n  - Finite Difference Policy Gradient\n  - Monte-Carlo Policy Gradient\n  - Actor-Critic Policy Gradient\n\n- **[Policy gradient intro](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=XGmd3wcyDg8&t=0s&list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37&index=3) - Sergey Levine (RECAP, optional)**\n  - Policy Gradient (REINFORCE and Vanilla PG)\n  - Variance reduction\n\n* **[Actor-Critic](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Tol_jw5hWnI&list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37&index=4) - Sergey Levine (More in depth)**\n  - Actor-Critic\n  - Discout factor\n  - Actor-Critic algorithm design (batch mode or online)\n  - state-dependent baseline\n\n##\n\n### Project of the Week - [**Vanilla PG and A2C**](Week4) \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"48\"\u002F>\n\n[**Vanilla PG and A2C applied to CartPole**](Week4) - The exercise of this week is to implement a policy gradient method or a more sophisticated actor-critic. In the repository you can find an implemented version of [PG and A2C](Week4). Bug Alert! Pay attention that A2C give me strange result. \nIf you find the implementation of PG and A2C easy, you can try with the [asynchronous version of A2C (A3C)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1602.01783.pdf).\n\n##\n\n#### Papers\n\n- [Policy Gradient methods for reinforcement learning with function approximation](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf)\n- [Asynchronous Methods for Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1602.01783.pdf)\n\n#### Other Resources\n  - :books: [The \"Bible\" of Reinforcement Learning: Chapters 9 and 10](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2HRSSmh?tag=andreaaffilia-20) - Sutton & Barto\n  - :books: [Intuitive RL: Intro to Advantage-Actor-Critic (A2C)](https:\u002F\u002Fhackernoon.com\u002Fintuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752)\n  - :books: [Asynchronous Actor-Critic Agents (A3C)](https:\u002F\u002Fmedium.com\u002Femergent-future\u002Fsimple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2)\n\n\u003Cbr>\n\n## Week 5 - Advanced Policy Gradients - PPO\n\nThis week is about advanced policy gradient methods that improve the stability and the convergence of the \"Vanilla\" policy gradient methods. You'll learn and implement PPO, a RL algorithm developed by OpenAI and adopted in [OpenAI Five](https:\u002F\u002Fblog.openai.com\u002Fopenai-five\u002F).\n\n##\n\n### Lectures - Theory \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n- **[Advanced policy gradients](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=ycCtmp4hcUs&t=0s&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&index=15) - Sergey Levine (UC Berkley)**\n  - Problems with \"Vanilla\" Policy Gradient Methods\n  - Policy Performance Bounds\n  - Monotonic Improvement Theory\n  - Algorithms: NPO, TRPO, PPO\n\n* **[Natural Policy Gradients, TRPO, PPO](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=xvRrgxcpaHY) - John Schulman (Berkey DRL Bootcamp)** - (RECAP, optional)\n  * Limitations of \"Vanilla\" Policy Gradient Methods\n  * Natural Policy Gradient\n  * Trust Region Policy Optimization, TRPO\n  * Proximal Policy Optimization, PPO\n\n##\n\n### Project of the Week - [**PPO**](Week5) \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"48\"\u002F>\n\n\u003Cimg align=\"left\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_a69e59813d28.gif\" alt=\"drawing\" width=\"300\"\u002F> \n\n[**PPO applied to BipedalWalker**](Week5) - This week, you have to implement PPO or TRPO. I suggest PPO given its simplicity (compared to TRPO). In the project folder Week5 you find an implementation of [**PPO that learn to play BipedalWalker**](Week5).\nFurthermore, in the folder you can find other resources that will help you in the development of the project. Have fun!\n\n\u003Cbr clear=\"left\"\u002F>\n\nTo learn more about PPO read the [paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06347.pdf) and take a look at the [Arxiv Insights's video](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=5P7I-xPq8u8)\n\n##\n\n#### Papers\n\n- [Trust Region Policy Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1502.05477.pdf) - 2015\n- [Proximal Policy Optimization Algorithms](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06347.pdf) - 2017\n\n#### Other Resources\n  - :books: To better understand PPO and TRPO: [The Pursuit of (Robotic) Happiness](https:\u002F\u002Ftowardsdatascience.com\u002Fthe-pursuit-of-robotic-happiness-how-trpo-and-ppo-stabilize-policy-gradient-methods-545784094e3b)\n  - :tv: [Nuts and Bolts of Deep RL](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=8EcdaCk9KaQ&)\n  - :books: PPO best practice: [Training with Proximal Policy Optimization](https:\u002F\u002Fgithub.com\u002FUnity-Technologies\u002Fml-agents\u002Fblob\u002Fmaster\u002Fdocs\u002FTraining-PPO.md)\n  - :tv: [Explanation of the PPO algorithm by Arxiv Insights](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=5P7I-xPq8u8)\n\n\u003Cbr>\n\n## Week 6 - Evolution Strategies and Genetic Algorithms - ES\n\nIn the last year, Evolution strategies (ES) and Genetic Algorithms (GA) has been shown to achieve comparable results to RL methods. They are derivate-free black-box algorithms that require more data than RL to learn but are able to scale up across thousands of CPUs. This week we'll look at this black-box algorithms.\n\n##\n\n### Lectures & Articles - Theory \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n- **Evolution Strategies**\n  - [Intro to ES: A Visual Guide to Evolution Strategies](http:\u002F\u002Fblog.otoro.net\u002F2017\u002F10\u002F29\u002Fvisual-evolution-strategies\u002F)\n  - [ES for RL: Evolving Stable Strategies](http:\u002F\u002Fblog.otoro.net\u002F2017\u002F11\u002F12\u002Fevolving-stable-strategies\u002F)\n  - [Derivative-free Methods - Lecture](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=SQtOI9jsrJ0&feature=youtu.be)\n  - [Evolution Strategies (paper discussion)](https:\u002F\u002Fblog.openai.com\u002Fevolution-strategies\u002F)\n- **Genetic Algorithms**\n  - [Introduction to Genetic Algorithms — Including Example Code](https:\u002F\u002Ftowardsdatascience.com\u002Fintroduction-to-genetic-algorithms-including-example-code-e396e98d8bf3)\n\n##\n\n### Project of the Week - [**ES**](Week6) \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"48\"\u002F>\n\n\u003Cimg align=\"left\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_64d73a1594ce.gif\" alt=\"drawing\" width=\"300\"\u002F> \n\n[**Evolution Strategies applied to LunarLander**](Week6) - This week the project is to implement a ES or GA.\nIn the [**Week6 folder**](Week6) you can find a basic implementation of the paper [Evolution Strategies as a\nScalable Alternative to Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1703.03864.pdf) to solve LunarLanderContinuous. You can modify it to play more difficult environments or add your ideas.\n\n\u003Cbr clear=\"left\"\u002F>\n\n##\n\n#### Papers\n\n - [Deep Neuroevolution: Genetic Algorithms are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1712.06567.pdf)\n - [Evolution Strategies as a Scalable Alternative to Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1703.03864.pdf)\n \n #### Other Resources\n  - :books: [Evolutionary Optimization Algorithms](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F34EphXc?tag=andreaaffilia-20) - Dan Simon\n  \n\u003Cbr>\n\n## Week 7 - Model-Based reinforcement learning - MB-MF\n\nThe algorithms studied up to now are model-free, meaning that they only choose the better action given a state. These algorithms achieve very good performance but require a lot of training data. Instead, model-based algorithms, learn the environment and plan the next actions accordingly to the model learned. These methods are more sample efficient than model-free but overall achieve worst performance. In this week you'll learn the theory behind these methods and implement one of the last algorithms.\n\n##\n\n### Lectures - Theory \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n- **Model-Based RL, David Silver (DeepMind) (concise version)**\n  - [Integrating Learning and Planning](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=ItMutbeOHtc&index=8&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ)\n    - Model-Based RL Overview\n    - Integrated architectures\n    - Simulation-Based search\n- **Model-Based RL, Sergey Levine (UC Berkley) (in depth version)**\n  - [Learning dynamical systems from data](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=yap_g0d7iBQ&index=9&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3)\n    - Overview of model-based RL\n    - Global and local models\n    - Learning with local models and trust regions\n  - [Learning policies by imitating optimal controllers](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=AwdauFLan7M&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&index=10)\n    - Backpropagation into a policy with learned models\n    - Guided policy search algorithm\n    - Imitating optimal control with DAgger\n  - [Advanced model learning and images](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=vRkIwM4GktE&index=11&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3)\n    - Models in latent space\n    - Models directly in image space\n    - Inverse models\n\n\n##\n\n### Project of the Week - [**MB-MF**](Week7) \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"48\"\u002F>\n\n\u003Cimg align=\"left\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_ccc5c8c3b4f0.gif\" alt=\"drawing\" width=\"300\"\u002F> \n\n[**MB-MF applied to RoboschoolAnt**](Week7) - This week I chose to implement the model-based algorithm described in this [paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1708.02596.pdf).\nYou can find my implementation [here](Week7).\nNB: Instead of implementing it on Mujoco as in the paper, I used [RoboSchool](https:\u002F\u002Fgithub.com\u002Fopenai\u002Froboschool), an open-source simulator for robot, integrated with OpenAI Gym.\n\n\u003Cbr clear=\"left\"\u002F>\n\n##\n\n#### Papers\n\n - [Imagination-Augmented Agents for Deep Reinforcement Learning - 2017](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06203.pdf)\n - [Reinforcement learning with unsupervised auxiliary tasks - 2016](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1611.05397.pdf)\n - [Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning - 2018](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1708.02596.pdf)\n \n#### Other Resources\n  - :books: [The \"Bible\" of Reinforcement Learning: Chapter 8](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2HRSSmh?tag=andreaaffilia-20) - Sutton & Barto\n  - :books: [World Models - Can agents learn inside of their own dreams?](https:\u002F\u002Fworldmodels.github.io\u002F)\n\n\u003Cbr>\n\n## Week 8 - Advanced Concepts and Project Of Your Choice\n\nThis last week is about advanced RL concepts and a project of your choice.\n\n##\n\n### Lectures - Theory \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n- Sergey Levine (Berkley)\n  - [Connection between inference and control](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=iOYiPhu5GEk&index=13&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&t=0s)\n  - [Inverse reinforcement learning](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=-3BcZwgmZLk&index=14&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&t=0s)\n  - [Exploration (part 1)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=npi6B4VQ-7s&index=16&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&t=0s)\n  - [Exploration (part 2) and transfer learning](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=0WbVUvKJpg4&index=17&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&t=0s)\n  - [Multi-task learning and transfer](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UqSx23W9RYE&index=18&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&t=0s)\n  - [Meta-learning and parallelism](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Xe9bktyYB34&index=18&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3)\n  - [Advanced imitation learning and open problems](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=mc-DtbhhiKA&index=20&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&t=0s)\n- David Silver (DeepMind)\n  - [Classic Games](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=N1LKLc6ufGY&feature=youtu.be)\n\n\n##\n\n### The final project\nHere you can find some project ideas.\n - [Pommerman](https:\u002F\u002Fwww.pommerman.com\u002F) (Multiplayer)\n - [AI for Prosthetics Challenge](https:\u002F\u002Fwww.crowdai.org\u002Fchallenges\u002Fnips-2018-ai-for-prosthetics-challenge) (Challenge)\n - [Word Models](https:\u002F\u002Fworldmodels.github.io\u002F) (Paper implementation)\n - [Request for research OpenAI](https:\u002F\u002Fblog.openai.com\u002Frequests-for-research-2\u002F) (Research)\n - [Retro Contest](https:\u002F\u002Fblog.openai.com\u002Fretro-contest\u002F) (Transfer learning)\n\n##\n\n#### Other Resources\n* AlphaGo Zero\n  - [Paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnature24270.epdf?author_access_token=VJXbVjaSHxFoctQQ4p2k4tRgN0jAjWel9jnR3ZoTv0PVW4gB86EEpGqTRDtpIz-2rmo8-KG06gqVobU5NSCFeHILHcVFUeMsbvwS-lxjqQGg98faovwjxeTUgZAUMnRQ)\n  - DeepMind blog post: [AlphaGo Zero: Learning from scratch](https:\u002F\u002Fdeepmind.com\u002Fblog\u002Falphago-zero-learning-scratch\u002F)\n  - Arxiv Insights video: [How AlphaGo Zero works - Google DeepMind](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=MgowR4pq3e8)\n* OpenAI Five\n  - OpenAI blog post: [OpenAI Five](https:\u002F\u002Fblog.openai.com\u002Fopenai-five\u002F)\n  - Arxiv Insights video: [OpenAI Five: Facing Human Pro's in Dota II](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=0eO2TSVVP1Y)\n\n\u003Cbr>\n\n## Last 4 days - Review + Sharing\n\nCongratulation for completing the 60 Days RL Challenge!! Let me know if you enjoyed it and share it!\n\nSee you!\n\n## Best resources\n\n:books: [Reinforcement Learning: An Introduction](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2HRSSmh?tag=andreaaffilia-20) - by Sutton & Barto. The \"Bible\" of reinforcement learning. [Here](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG\u002Fview) you can find the PDF draft of the second version.\n\n:books: [Deep Reinforcement Learning Hands-On](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2PRxKD7?tag=andreaaffilia-20) - by Maxim Lapan\n\n:books: [Deep Learning](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2N3AIlp?tag=andreaaffilia-20) - Ian Goodfellow\n\n:tv: [Deep Reinforcement Learning](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3) - UC Berkeley class by Levine, check [here](http:\u002F\u002Frail.eecs.berkeley.edu\u002Fdeeprlcourse\u002F) their site.\n\n:tv: [Reinforcement Learning course](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ) - by David Silver, DeepMind. Great introductory lectures by Silver, a lead researcher on AlphaGo. They follow the book Reinforcement Learning by Sutton & Barto.\n\n\n\n## Additional resources\n\n:books: [Awesome Reinforcement Learning](https:\u002F\u002Fgithub.com\u002Faikorea\u002Fawesome-rl). A curated list of resources dedicated to reinforcement learning\n\n:books: [GroundAI on RL](https:\u002F\u002Fwww.groundai.com\u002F?text=reinforcement+learning). Papers on reinforcement learning\n\n\n## A cup of Coffe :coffee:\n\nAny contribution is higly appreciated! Cheers!\n\n[![paypal](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_3ed0eaf21fb2.gif)](https:\u002F\u002Fwww.paypal.com\u002Fcgi-bin\u002Fwebscr?cmd=_s-xclick&hosted_button_id=NKSNP93CNY4KN)\n","![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_7a51165c63b3.png)\n\n## 深度强化学习课程\n\n### 探索神经网络与强化学习的结合。Python & PyTorch 中的算法与示例\n\n\n你是否听说过 [DeepMind 用 AlphaGo Zero](https:\u002F\u002Fwww.youtube.com\u002Fwatch?time_continue=24&v=tXlM99xPQC8) 以及 [OpenAI 在 Dota 2 中](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=l92J1UvHf6M) 取得的惊人成果？这一切都离不开深度神经网络和强化学习。你想深入了解它们吗？  \n这是一个绝佳的机会，让你真正掌握深度强化学习，并将其应用于各种新颖而令人兴奋的项目和应用中。  \n\n在这里，你将找到对这些算法的深入介绍。其中包括 Q 学习、深度 Q 学习、PPO、Actor-Critic 等，并使用 Python 和 PyTorch 实现它们。\n\n> 最终目标是利用这些通用技术，将其应用于各类重要的现实世界问题。\n> **德米斯·哈萨比斯**\n\n\n本仓库包含：  \n\n\u003Cbr>\n\n\u003Cimg align=\"left\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"64\"\u002F> **主要来自 DeepMind 和 Berkley YouTube 频道的讲座（及其他内容）。**\n\n\u003Cbr>\n\n\u003Cimg align=\"left\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"64\"\u002F> **在 PyTorch 中实现的算法（如 DQN、A2C 和 PPO），并在 OpenAI Gym 的 RoboSchool 和 Atari 游戏上进行测试。**\n\n\n\u003Cbr>\n\u003Cbr>\n\n**请持续关注并关注我：** [![Twitter Follow](https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Ffollow\u002Fespadrine.svg?style=social&label=Follow)](https:\u002F\u002Ftwitter.com\u002Fandri27_it) 和 [![GitHub followers](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Ffollowers\u002Fespadrine.svg?style=social&label=Follow)](https:\u002F\u002Fgithub.com\u002Fandri27-ts)   **#60DaysRLChallenge**\n\n现在我们还有一个 [**Slack 频道**](https:\u002F\u002F60daysrlchallenge.slack.com\u002F)。如需邀请，请发送邮件至 andrea.lonza@gmail.com。如果你有任何想法、建议或改进意见，也欢迎随时联系我。  \n\n若想学习深度学习、计算机视觉或自然语言处理，可以查看我的 **[1-Year-ML-Journey](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F1-Year-ML-Journey)**\n\n\n### 开始之前.. 先决条件\n* 基础级别的 Python 和 PyTorch\n* [机器学习](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F1-Year-ML-Journey)\n* [深度学习基础知识（MLP、CNN 和 RNN）](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2N3AIlp?tag=andreaaffilia-20)\n\n\u003Cbr>\n\u003Cbr>\n\n\n## 快速提醒：我的新书已出版！\n若想更深入地学习强化学习和深度强化学习，不妨看看我的书 [**Reinforcement Learning Algorithms with Python**](https:\u002F\u002Fwww.amazon.com\u002FReinforcement-Learning-Algorithms-Python-understand\u002Fdp\u002F1789131111)！！\n\n\u003Ca href=\"https:\u002F\u002Fwww.amazon.com\u002FReinforcement-Learning-Algorithms-Python-understand\u002Fdp\u002F1789131111\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_c6c37de1f752.jpg\" alt=\"drawing\" width=\"350\" align=\"right\"\u002F>\n\u003C\u002Fa>\n\n**目录**\n1. 强化学习概览\n2. 实现 RL 循环及 OpenAI Gym\n3. 使用动态规划解决问题\n4. Q 学习与 SARSA 的应用\n5. 深度 Q 网络\n6. 学习随机优化与 DDPG\n7. TRPO 和 PPO 的实现\n8. DDPG 和 TD3 的应用\n9. 基于模型的 RL\n10. 使用 DAgger 算法进行模仿学习\n11. 理解黑盒优化算法\n12. 开发 ESBAS 算法\n13. 解决 RL 挑战的实际应用\n\n\n\u003Cbr>\n\u003Cbr>\n\u003Cbr>\n\n\n## 索引 - 强化学习\n \n - [第 1 周 - **简介**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-1---introduction)\n - [第 2 周 - **RL 基础**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-2---rl-basics-mdp-dynamic-programming-and-model-free-control)\n - [第 3 周 - **基于价值的算法 - DQN**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-3---value-function-approximation-and-dqn)\n - [第 4 周 - **策略梯度算法 - REINFORCE & A2C**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-4---policy-gradient-methods-and-a2c)\n - [第 5 周 - **高级策略梯度 - PPO**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-5---advanced-policy-gradients---trpo--ppo)\n - [第 6 周 - **进化策略与遗传算法 - ES**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-6---evolution-strategies-and-genetic-algorithms)\n - [第 7 周 - **基于模型的强化学习 - MB-MF**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#week-7---model-based-reinforcement-learning)\n - [第 8 周 - **进阶概念与自选项目**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge\u002Fblob\u002Fmaster\u002FREADME.md#week-8---advanced-concepts-and-project-of-your-choice)\n - [最后 4 天 - **复习 + 分享**](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge\u002Fblob\u002Fmaster\u002FREADME.md#last-4-days---review--sharing)\n - [最佳资源](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#best-resources)\n - [附加资源](https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge#additional-resources)\n\u003Cbr>\n\n## 第 1 周 - 简介\n\n- **[为什么强化学习是一种如此重要的学习方法 - 简单解释](https:\u002F\u002Fmedium.com\u002F@andrea.lonzats\u002Fthe-learning-machines-fb922e539335)**\n- **[简介及课程概述](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Q4kF8sfggoI&index=1&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3) - CS294 by Levine, Berkley**\n- **[深度强化学习：从像素控制乒乓球](http:\u002F\u002Fkarpathy.github.io\u002F2016\u002F05\u002F31\u002Frl\u002F) by Karpathy**\n\n##\n\n#### 其他资源\n\n- :books: [强化学习“圣经”：第 1 章](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2HRSSmh?tag=andreaaffilia-20) - Sutton & Barto\n- 优秀的入门论文：[深度强化学习综述](https:\u002F\u002Fwww.groundai.com\u002Fproject\u002Fdeep-reinforcement-learning-an-overview\u002F)\n- 开始编码：[从零开始：用 50 行 Python 实现 AI 平衡动作](https:\u002F\u002Ftowardsdatascience.com\u002Ffrom-scratch-ai-balancing-act-in-50-lines-of-python-7ea67ef717)\n\n\u003Cbr>\n\n## 第 2 周 - RL 基础：*MDP、动态规划和无模型控制*\n\n> 不记得过去的人注定要重蹈覆辙 - **乔治·桑塔亚纳**\n\n\n本周，我们将学习强化学习的基本组成部分，从问题的定义开始，一直到用于评估策略或状态质量的函数的估计和优化。\n\n##\n\n### 讲座 - 理论 \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n\n* **[马尔可夫决策过程](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=lfHX2hHRMVQ&list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-&index=2) - 大卫·西尔弗（DeepMind）**\n  * 马尔可夫过程\n  * 马尔可夫决策过程\n\n- **[基于动态规划的规划](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Nd1-UUMVfz4&list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-&index=3) - 大卫·西尔弗（DeepMind）**\n  * 策略迭代\n  * 值迭代\n\n* **[无模型预测](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=PnHCvfgC_ZA&index=4&list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-) - 大卫·西尔弗（DeepMind）**\n  * 蒙特卡洛学习\n  * 时间差分学习\n  * TD(λ)\n\n- **[无模型控制](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=0g4j2k_Ggc4&list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-&index=5) - 大卫·西尔弗（DeepMind）**\n  * ε-贪婪策略迭代\n  * GLIE蒙特卡洛搜索\n  * SARSA\n  * 重要性采样\n\n##\n\n### 本周项目 - [**Q-learning**](Week2\u002Ffrozenlake_Qlearning.ipynb) \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"48\"\u002F>\n\n[**应用于FrozenLake的Q-learning**](Week2\u002Ffrozenlake_Qlearning.ipynb) - 作为练习，你可以使用SARSA来解决这个游戏，或者自己实现Q-learning。在前一种情况下，只需要进行少量修改。\n\n##\n\n#### 其他资源\n- :books: [强化学习“圣经”：第3章和第4章](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2HRSSmh?tag=andreaaffilia-20) - 萨顿与巴托\n- :tv: [价值函数简介](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=k1vNh4rNYec&index=6&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ) - 加州大学伯克利分校深度强化学习课程，谢尔盖·列文主讲\n\n\u003Cbr>\n\n## 第3周 - 基于值的算法 - DQN\n\n本周我们将学习更高级的概念，并将深度神经网络应用于Q-learning算法。\n\n##\n\n### 讲座 - 理论 \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n- **[价值函数近似](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UoPei5o4fps&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=6) - 大卫·西尔弗（DeepMind）**\n  - 可微函数近似器\n  - 增量式方法\n  - 批量式方法（DQN）\n\n* **[高级Q-learning算法](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=nZXC5OdDfs4&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&index=7) - 谢尔盖·列文（加州大学伯克利分校）**\n  - 经验回放缓冲区\n  - 双重Q-learning\n  - 连续动作（NAF、DDPG）\n  - 实用技巧\n\n##\n\n### 本周项目 - [**DQN及其变体**](Week3) \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"48\"\u002F>\n\n\n\u003Cimg align=\"left\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_51febd650e6d.gif\" alt=\"drawing\" width=\"200\"\u002F> \n\n[**应用于Pong的DQN及其一些变体**](Week3) - 本周的目标是开发一个DQN算法来玩Atari游戏。为了让它更有趣，我开发了DQN的三种扩展：**双重Q-learning**、**多步学习**、**斗士网络**和**噪声网络**。大家可以尝试这些改进，如果你有信心，还可以实现优先级经验回放、斗士网络或分布强化学习。要了解更多关于这些改进的信息，请阅读相关论文！\n\n\u003Cbr clear=\"left\"\u002F>\n\n##\n\n\n#### 论文\n\n##### 必读\n - [使用深度强化学习玩Atari游戏](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1312.5602.pdf) - 2013年\n - [通过深度强化学习达到人类水平的控制](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fdqn\u002FDQNNaturePaper.pdf) - 2015年\n - [彩虹：结合深度强化学习中的多项改进](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1710.02298.pdf) - 2017年\n\n##### DQN的扩展\n - [深度强化学习中的双重Q-learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1509.06461.pdf) - 2015年\n - [优先级经验回放](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1511.05952.pdf) - 2015年\n - [用于深度强化学习的斗士网络架构](http:\u002F\u002Fproceedings.mlr.press\u002Fv48\u002Fwangf16.pdf) - 2016年\n - [用于探索的噪声网络](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1706.10295.pdf) - 2017年\n - [基于分位数回归的分布强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1710.10044.pdf) - 2017年\n \n#### 其他资源\n  - :books: [强化学习“圣经”：第5章和第6章](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2HRSSmh?tag=andreaaffilia-20) - 萨顿与巴托\n  - :tv: [企业中的深度强化学习：弥合游戏与工业之间的鸿沟](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=GOsUHlr4DKE)\n\n\u003Cbr>\n\n## 第4周 - 策略梯度算法 - REINFORCE & A2C\n\n第4周将介绍策略梯度方法，这是一类直接优化策略的算法。此外，你还将学习演员-评论家算法。这些算法结合了策略梯度（演员）和价值函数（评论家）。\n\n##\n\n### 讲座 - 理论 \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n* **[策略梯度方法](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=KHZVXao4qXs&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=7) - 大卫·西尔弗（DeepMind）**\n  * 有限差分策略梯度\n  * 蒙特卡洛策略梯度\n  * 演员-评论家策略梯度\n\n- **[策略梯度简介](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=XGmd3wcyDg8&t=0s&list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37&index=3) - 谢尔盖·列文（RECAP，可选）**\n  - 策略梯度（REINFORCE和普通策略梯度）\n  - 方差减小\n\n* **[演员-评论家](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Tol_jw5hWnI&list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37&index=4) - 谢尔盖·列文（更深入）**\n  - 演员-评论家\n  - 折现因子\n  - 演员-评论家算法设计（批处理模式或在线模式）\n  - 与状态相关的基线\n\n##\n\n### 本周项目 - [**普通策略梯度和A2C**](Week4) \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"48\"\u002F>\n\n[**应用于CartPole的普通策略梯度和A2C**](Week4) - 本周的练习是实现一种策略梯度方法，或者更复杂的演员-评论家算法。在仓库中你可以找到[策略梯度和A2C的实现版本](Week4)。注意！A2C给我返回了一个奇怪的结果。 \n如果你觉得实现策略梯度和A2C很简单，可以尝试一下[A3C的异步版本](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1602.01783.pdf）。\n\n##\n\n#### 论文\n\n- [带有函数逼近的强化学习中的策略梯度方法](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf)\n- [深度强化学习的异步方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1602.01783.pdf)\n\n#### 其他资源\n  - :books: [强化学习“圣经”：第9章和第10章](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2HRSSmh?tag=andreaaffilia-20) - 萨顿 & 巴托\n  - :books: [直观的强化学习：优势-演员-评论家（A2C）简介](https:\u002F\u002Fhackernoon.com\u002Fintuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752)\n  - :books: [异步演员-评论家智能体（A3C）](https:\u002F\u002Fmedium.com\u002Femergent-future\u002Fsimple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2)\n\n\u003Cbr>\n\n## 第5周 - 高级策略梯度 - PPO\n\n本周将介绍高级策略梯度方法，这些方法能够提高“普通”策略梯度方法的稳定性和收敛性。你将学习并实现PPO，这是一种由OpenAI开发并在[OpenAI Five](https:\u002F\u002Fblog.openai.com\u002Fopenai-five\u002F)中采用的强化学习算法。\n\n##\n\n### 讲座 - 理论 \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n- **[高级策略梯度](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=ycCtmp4hcUs&t=0s&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&index=15) - 谢尔盖·列维纳（UC伯克利）**\n  - “普通”策略梯度方法的问题\n  - 策略性能边界\n  - 单调改进理论\n  - 算法：NPO、TRPO、PPO\n\n* **[自然策略梯度、TRPO、PPO](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=xvRrgxcpaHY) - 约翰·舒尔曼（伯克利DRL训练营）** - （回顾，可选）\n  * “普通”策略梯度方法的局限性\n  * 自然策略梯度\n  * 信任域策略优化，TRPO\n  * 近端策略优化，PPO\n\n##\n\n### 本周项目 - [**PPO**](Week5) \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"48\"\u002F>\n\n\u003Cimg align=\"left\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_a69e59813d28.gif\" alt=\"drawing\" width=\"300\"\u002F> \n\n[**应用于BipedalWalker的PPO**](Week5) - 本周你需要实现PPO或TRPO。鉴于PPO相比TRPO更为简单，我建议选择PPO。在Week5项目文件夹中，你可以找到一个[**学习玩BipedalWalker的PPO实现**](Week5)。此外，该文件夹中还有其他资源，可以帮助你完成项目。祝你玩得开心！\n\n\u003Cbr clear=\"left\"\u002F>\n\n要深入了解PPO，请阅读[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06347.pdf)并观看[Arxiv Insights的视频](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=5P7I-xPq8u8)。\n\n##\n\n#### 论文\n\n- [信任域策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1502.05477.pdf) - 2015年\n- [近端策略优化算法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06347.pdf) - 2017年\n\n#### 其他资源\n  - :books: 为了更好地理解PPO和TRPO：[追求（机器人）幸福](https:\u002F\u002Ftowardsdatascience.com\u002Fthe-pursuit-of-robotic-happiness-how-trpo-and-ppo-stabilize-policy-gradient-methods-545784094e3b)\n  - :tv: [深度强化学习的要点](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=8EcdaCk9KaQ&)\n  - :books: PPO最佳实践：[使用近端策略优化进行训练](https:\u002F\u002Fgithub.com\u002FUnity-Technologies\u002Fml-agents\u002Fblob\u002Fmaster\u002Fdocs\u002FTraining-PPO.md)\n  - :tv: [Arxiv Insights对PPO算法的解释](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=5P7I-xPq8u8)\n\n\u003Cbr>\n\n## 第6周 - 进化策略与遗传算法 - ES\n\n在过去一年里，进化策略（ES）和遗传算法（GA）已被证明能够达到与强化学习方法相当的效果。它们是无需导数的黑箱算法，虽然需要比强化学习更多的数据来学习，但能够在数千个CPU上扩展。本周我们将探讨这些黑箱算法。\n\n##\n\n### 讲座与文章 - 理论 \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n- **进化策略**\n  - [ES简介：进化策略的可视化指南](http:\u002F\u002Fblog.otoro.net\u002F2017\u002F10\u002F29\u002Fvisual-evolution-strategies\u002F)\n  - [用于RL的ES：进化稳定策略](http:\u002F\u002Fblog.otoro.net\u002F2017\u002F11\u002F12\u002Fevolving-stable-strategies\u002F)\n  - [无导数方法 - 讲座](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=SQtOI9jsrJ0&feature=youtu.be)\n  - [进化策略（论文讨论）](https:\u002F\u002Fblog.openai.com\u002Fevolution-strategies\u002F)\n- **遗传算法**\n  - [遗传算法简介——包括示例代码](https:\u002F\u002Ftowardsdatascience.com\u002Fintroduction-to-genetic-algorithms-including-example-code-e396e98d8bf3)\n\n##\n\n### 本周项目 - [**ES**](Week6) \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"48\"\u002F>\n\n\u003Cimg align=\"left\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_64d73a1594ce.gif\" alt=\"drawing\" width=\"300\"\u002F> \n\n[**应用于LunarLander的进化策略**](Week6) - 本周的项目是实现一个ES或GA。\n在[**Week6文件夹**](Week6)中，你可以找到一篇论文[进化策略作为强化学习的可扩展替代方案](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1703.03864.pdf)的基本实现，用于解决LunarLanderContinuous问题。你可以对其进行修改，以应对更复杂的环境，或者加入自己的想法。\n\n\u003Cbr clear=\"left\"\u002F>\n\n##\n\n#### 论文\n\n - [深度神经进化：遗传算法是训练用于强化学习的深度神经网络的一种有竞争力的替代方案](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1712.06567.pdf)\n - [进化策略作为强化学习的可扩展替代方案](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1703.03864.pdf)\n \n #### 其他资源\n  - :books: [进化优化算法](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F34EphXc?tag=andreaaffilia-20) - 丹·西蒙\n  \n\u003Cbr>\n\n## 第7周 - 基于模型的强化学习 - MB-MF\n\n到目前为止所研究的算法都是无模型的，也就是说，它们只是根据当前状态选择最优动作。这些算法虽然性能很好，但需要大量的训练数据。而基于模型的算法则会学习环境，并根据所学模型规划下一步行动。这类方法比无模型方法更节省样本，但总体表现通常较差。本周你将学习这些方法背后的理论，并实现其中一种最新的算法。\n\n##\n\n### 讲座 - 理论 \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n- **基于模型的强化学习，戴维·西尔弗（DeepMind）（精简版）**\n  - [整合学习与规划](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=ItMutbeOHtc&index=8&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ)\n    - 基于模型的强化学习概述\n    - 集成架构\n    - 基于模拟的搜索\n- **基于模型的强化学习，谢尔盖·列文（UC伯克利）（深入版）**\n  - [从数据中学习动力学系统](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=yap_g0d7iBQ&index=9&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3)\n    - 基于模型的强化学习概述\n    - 全局模型与局部模型\n    - 使用局部模型和信任区域进行学习\n  - [通过模仿最优控制器来学习策略](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=AwdauFLan7M&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&index=10)\n    - 利用已学习的模型对策略进行反向传播\n    - 引导式策略搜索算法\n    - 使用DAgger模仿最优控制\n  - [高级模型学习与图像](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=vRkIwM4GktE&index=11&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3)\n    - 隐空间中的模型\n    - 直接在图像空间中的模型\n    - 逆模型\n\n\n##\n\n### 本周项目 - [**MB-MF**](Week7) \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_438c17272c5f.png\" alt=\"drawing\" width=\"48\"\u002F>\n\n\u003Cimg align=\"left\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_ccc5c8c3b4f0.gif\" alt=\"drawing\" width=\"300\"\u002F> \n\n[**MB-MF应用于RoboschoolAnt**](Week7) - 本周我选择实现这篇[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1708.02596.pdf)中描述的基于模型的算法。\n我的实现可以在这里找到[这里](Week7)。\n注意：我没有像论文中那样在Mujoco上实现，而是使用了[RoboSchool](https:\u002F\u002Fgithub.com\u002Fopenai\u002Froboschool)，这是一个与OpenAI Gym集成的开源机器人模拟器。\n\n\u003Cbr clear=\"left\"\u002F>\n\n##\n\n#### 论文\n\n - [想象增强型智能体用于深度强化学习 - 2017年](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06203.pdf)\n - [无监督辅助任务的强化学习 - 2016年](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1611.05397.pdf)\n - [基于模型的深度强化学习中带有无模型微调的神经网络动力学 - 2018年](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1708.02596.pdf)\n \n#### 其他资源\n  - :books: [强化学习“圣经”：第8章](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2HRSSmh?tag=andreaaffilia-20) - 萨顿 & 巴托\n  - :books: [世界模型 - 智能体能否在自己的梦境中学习？](https:\u002F\u002Fworldmodels.github.io\u002F)\n\n\u003Cbr>\n\n## 第8周 - 高级概念与自选项目\n\n最后一周将围绕强化学习的高级概念以及你自选的项目展开。\n\n##\n\n### 讲座 - 理论 \u003Cimg align=\"right\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_b54f7a639b5a.png\" alt=\"drawing\" width=\"48\"\u002F> \n\n- 谢尔盖·列文（伯克利）\n  - [推理与控制之间的联系](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=iOYiPhu5GEk&index=13&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&t=0s)\n  - [逆向强化学习](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=-3BcZwgmZLk&index=14&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&t=0s)\n  - [探索（第1部分）](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=npi6B4VQ-7s&index=16&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&t=0s)\n  - [探索（第2部分）与迁移学习](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=0WbVUvKJpg4&index=17&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&t=0s)\n  - [多任务学习与迁移](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UqSx23W9RYE&index=18&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&t=0s)\n  - [元学习与并行化](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Xe9bktyYB34&index=18&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3)\n  - [高级模仿学习与未解决的问题](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=mc-DtbhhiKA&index=20&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&t=0s)\n- 戴维·西尔弗（DeepMind）\n  - [经典游戏](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=N1LKLc6ufGY&feature=youtu.be)\n\n\n##\n\n### 最终项目\n这里有一些项目想法可供参考。\n - [Pommerman](https:\u002F\u002Fwww.pommerman.com\u002F)（多人游戏）\n - [假肢人工智能挑战赛](https:\u002F\u002Fwww.crowdai.org\u002Fchallenges\u002Fnips-2018-ai-for-prosthetics-challenge)（挑战赛）\n - [Word Models](https:\u002F\u002Fworldmodels.github.io\u002F)（论文实现）\n - [OpenAI研究请求](https:\u002F\u002Fblog.openai.com\u002Frequests-for-research-2\u002F)（研究）\n - [复古竞赛](https:\u002F\u002Fblog.openai.com\u002Fretro-contest\u002F)（迁移学习）\n\n##\n\n#### 其他资源\n* AlphaGo Zero\n  - [论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnature24270.epdf?author_access_token=VJXbVjaSHxFoctQQ4p2k4tRgN0jAjWel9jnR3ZoTv0PVW4gB86EEpGqTRDtpIz-2rmo8-KG06gqVobU5NSCFeHILHcVFUeMsbvwS-lxjqQGg98faovwjxeTUgZAUMnRQ)\n  - DeepMind博客文章：[AlphaGo Zero：从零开始学习](https:\u002F\u002Fdeepmind.com\u002Fblog\u002Falphago-zero-learning-scratch\u002F)\n  - Arxiv Insights视频：[AlphaGo Zero的工作原理 - Google DeepMind](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=MgowR4pq3e8)\n* OpenAI Five\n  - OpenAI博客文章：[OpenAI Five](https:\u002F\u002Fblog.openai.com\u002Fopenai-five\u002F)\n  - Arxiv Insights视频：[OpenAI Five：迎战Dota II人类职业选手](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=0eO2TSVVP1Y)\n\n\u003Cbr>\n\n## 最后4天 - 复习 + 分享\n\n恭喜你完成了60天强化学习挑战！！请告诉我你是否喜欢这次挑战，并分享给更多人！\n\n再见！\n\n## 最佳资源\n\n:books: [强化学习：导论](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2HRSSmh?tag=andreaaffilia-20) - 萨顿 & 巴托著。强化学习领域的“圣经”。[这里](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG\u002Fview)可以找到第二版的PDF草稿。\n\n:books: [深度强化学习实战](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2PRxKD7?tag=andreaaffilia-20) - 马克西姆·拉潘著\n\n:books: [深度学习](https:\u002F\u002Fassoc-redirect.amazon.com\u002Fg\u002Fr\u002Fhttps:\u002F\u002Famzn.to\u002F2N3AIlp?tag=andreaaffilia-20) - 伊恩·古德费洛著\n\n:tv: [深度强化学习](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3) - UC伯克利列文教授的课程，详情请见他们的网站[这里](http:\u002F\u002Frail.eecs.berkeley.edu\u002Fdeeprlcourse\u002F)。\n\n:tv: [强化学习课程](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ) - 戴维·西尔弗，DeepMind。由AlphaGo的主要研究人员西尔弗主讲的优秀入门课程，内容遵循萨顿 & 巴托的《强化学习》一书。\n\n\n\n## 补充资源\n\n:books: [超赞的强化学习](https:\u002F\u002Fgithub.com\u002Faikorea\u002Fawesome-rl)。一份专门针对强化学习的精选资源列表\n\n:books: [GroundAI上的RL](https:\u002F\u002Fwww.groundai.com\u002F?text=reinforcement+learning)。关于强化学习的论文\n\n\n## 一杯咖啡 :coffee:\n\n任何贡献都将不胜感激！干杯！\n\n[![paypal](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_readme_3ed0eaf21fb2.gif)](https:\u002F\u002Fwww.paypal.com\u002Fcgi-bin\u002Fwebscr?cmd=_s-xclick&hosted_button_id=NKSNP93CNY4KN)","# Reinforcement-Learning 快速上手指南\n\n本指南基于 `Reinforcement-Learning` 开源项目，旨在帮助开发者快速掌握深度强化学习（Deep RL）的核心算法与实战应用。该项目结合了神经网络与强化学习，使用 Python 和 PyTorch 实现了 DQN、A2C、PPO 等主流算法，并在 OpenAI Gym 环境（如 Atari 游戏、RoboSchool）中进行了测试。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**：Linux, macOS 或 Windows\n*   **编程语言**：Python 3.6+\n*   **前置知识**：\n    *   具备基础的 Python 编程能力\n    *   熟悉 PyTorch 深度学习框架\n    *   了解机器学习基础（MLP, CNN, RNN）\n*   **核心依赖库**：\n    *   `torch` (PyTorch)\n    *   `gym` (OpenAI Gym 仿真环境)\n    *   `numpy`, `matplotlib` 等数据处理库\n\n> **提示**：本项目代码主要依赖 OpenAI Gym。由于网络原因，国内用户在安装 `gym` 或其特定环境（如 Atari）时可能会遇到连接问题，建议配置好网络环境或使用相关镜像源。\n\n## 安装步骤\n\n1.  **克隆仓库**\n    将项目代码下载到本地：\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Fandri27-ts\u002F60_Days_RL_Challenge.git\n    cd 60_Days_RL_Challenge\n    ```\n\n2.  **创建虚拟环境（推荐）**\n    为了避免依赖冲突，建议使用 `conda` 或 `venv` 创建独立环境：\n    ```bash\n    conda create -n rl_env python=3.8\n    conda activate rl_env\n    ```\n\n3.  **安装依赖**\n    安装 PyTorch 和 Gym 等必要库。\n    \n    *安装 PyTorch (根据是否使用 CUDA 选择):*\n    ```bash\n    # CPU 版本\n    pip install torch torchvision torchaudio\n    \n    # GPU 版本 (示例为 CUDA 11.8，请根据实际情况调整)\n    pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n    ```\n\n    *安装 Gym 及扩展:*\n    ```bash\n    pip install gymnasium[all]\n    # 注意：原项目可能基于旧版 gym，若遇到兼容性问题，可尝试安装旧版：\n    # pip install gym==0.21.0\n    ```\n\n    *安装其他通用依赖:*\n    ```bash\n    pip install numpy matplotlib tqdm\n    ```\n\n## 基本使用\n\n本项目以“周”为单位组织学习内容，每周包含理论视频链接和对应的代码实现。以下以 **Week 2 的 Q-learning 实战** 和 **Week 3 的 DQN 实战** 为例展示如何使用。\n\n### 示例 1：运行 Q-learning 解决 FrozenLake 问题 (Week 2)\n\n进入 Week2 目录并运行 Jupyter Notebook 或对应的 Python 脚本，体验基础的表格型强化学习。\n\n```bash\ncd Week2\n# 如果环境支持 Jupyter\njupyter notebook frozenlake_Qlearning.ipynb\n```\n\n在 Notebook 中，你将看到如何初始化环境、定义 Q-table、执行训练循环以及评估策略。核心逻辑如下（伪代码示意）：\n\n```python\nimport gym\nimport numpy as np\n\n# 初始化环境\nenv = gym.make('FrozenLake-v1')\n\n# 初始化 Q-table\nq_table = np.zeros([env.observation_space.n, env.action_space.n])\n\n# 训练循环 (简化版)\nfor episode in range(total_episodes):\n    state = env.reset()\n    done = False\n    while not done:\n        # epsilon-greedy 策略选择动作\n        action = choose_action(state, q_table, epsilon) \n        next_state, reward, done, _ = env.step(action)\n        \n        # Q-learning 更新公式\n        old_value = q_table[state, action]\n        next_max = np.max(q_table[next_state])\n        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)\n        q_table[state, action] = new_value\n        \n        state = next_state\n```\n\n### 示例 2：运行 DQN 玩 Pong 游戏 (Week 3)\n\n进入 Week3 目录，这里展示了如何使用深度神经网络近似价值函数。项目提供了多种变体（Double DQN, Dueling Networks 等）。\n\n```bash\ncd Week3\n# 运行主训练脚本 (具体文件名请参考目录内容，通常为 main.py 或 dqn.py)\npython main.py --game PongNoFrameskip-v4 --algorithm DQN\n```\n\n若要尝试进阶变体，可修改参数：\n\n```bash\n# 运行 Double DQN\npython main.py --game PongNoFrameskip-v4 --algorithm DoubleDQN\n\n# 运行 Dueling Networks\npython main.py --game PongNoFrameskip-v4 --algorithm DuelingDQN\n```\n\n**代码结构说明：**\n*   `model.py`: 定义神经网络架构（如 CNN 用于处理图像输入）。\n*   `replay_buffer.py`: 实现经验回放机制（Experience Replay）。\n*   `agent.py`: 封装智能体的训练步骤（计算 Loss、反向传播、更新目标网络）。\n\n### 后续学习路径\n\n完成基础示例后，建议按照项目的 Index 顺序深入：\n*   **Week 4**: 学习策略梯度方法 (REINFORCE, A2C)。\n*   **Week 5**: 掌握高级策略梯度算法 (PPO, TRPO)。\n*   **Week 6-8**: 探索进化策略、基于模型的 RL 以及综合项目实战。\n\n所有算法均基于 PyTorch 实现，您可以直接修改 `agent.py` 中的网络结构或超参数，观察对训练效果的影响。","某自动驾驶初创公司的算法工程师正致力于训练一个能在复杂城市路口自主决策的智能驾驶代理，以应对多变的交通流和突发状况。\n\n### 没有 Reinforcement-Learning 时\n- **规则编写陷入死胡同**：工程师试图用硬编码规则（if-else）处理所有路况，但面对加塞、鬼探头等长尾场景，规则库变得臃肿且难以维护，系统极易崩溃。\n- **缺乏连续决策能力**：传统的监督学习仅能模仿历史数据，无法让车辆在未见过的动态环境中进行多步推理和长期回报优化，导致车辆在路口犹豫不决。\n- **试错成本极高**：在真实道路上测试未成熟的策略风险巨大，而现有的仿真环境缺乏高效的自学习机制，人工调整参数如同大海捞针，研发周期被无限拉长。\n- **理论落地门槛高**：团队虽知晓 PPO、DQN 等前沿算法，但缺乏从 DeepMind 或伯克利课程到 PyTorch 代码实现的系统性桥梁，复现顶级论文效果遥遥无期。\n\n### 使用 Reinforcement-Learning 后\n- **算法快速复现与验证**：利用工具中提供的 DQN、PPO 及 Actor-Critic 等成熟 PyTorch 实现，工程师直接在 OpenAI Gym 仿真环境中部署了基于深度神经网络的决策模型，无需从零造轮子。\n- **实现端到端自适应学习**：智能体通过 60 天挑战式的系统训练，学会了在动态博弈中平衡安全与效率，能够自主处理从未见过的复杂交互场景，不再依赖僵硬的规则。\n- **低成本高频迭代**：借助 RoboSchool 和 Atari 等测试基准，团队在虚拟环境中完成了数百万次的试错训练，将原本需要数月实车测试的验证过程压缩至数天。\n- **理论与实践无缝衔接**：结合配套的深度讲解视频与代码示例，团队成员迅速掌握了从马尔可夫决策过程到黑盒优化的核心逻辑，显著提升了算法调优效率。\n\nReinforcement-Learning 将原本依靠人工经验堆砌的驾驶规则，转化为具备自我进化能力的智能决策系统，大幅降低了高阶自动驾驶算法的研发门槛与时间成本。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fandri27-ts_Reinforcement-Learning_7a51165c.png","andri27-ts","Andrea","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fandri27-ts_33ff3a86.jpg","Machine Learning projects",null,"Italy","https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fandrea-lonza\u002F","https:\u002F\u002Fgithub.com\u002Fandri27-ts",[82,86],{"name":83,"color":84,"percentage":85},"Jupyter Notebook","#DA5B0B",53.6,{"name":87,"color":88,"percentage":89},"Python","#3572A5",46.4,4712,670,"2026-04-17T09:40:18","MIT","未说明","未说明 (项目基于 PyTorch 并在 OpenAI Gym\u002FAtari 上测试，通常建议使用支持 CUDA 的 NVIDIA GPU 以加速训练，但 README 未明确具体型号或显存要求)",{"notes":97,"python":98,"dependencies":99},"该项目是一个深度学习强化学习课程及代码库。前置知识要求具备基础的 Python 和 PyTorch 技能，以及机器学习（MLP, CNN, RNN）基础知识。代码主要在 OpenAI Gym 环境（如 RoboSchool 和 Atari 游戏）中进行测试。具体的版本依赖（如 PyTorch 版本、Python 版本）在提供的 README 片段中未明确列出，需参考项目内的具体代码文件或 requirements.txt（如果存在）。","未说明 (仅提及需要 Python 基础)",[100,101,87],"PyTorch","OpenAI Gym",[14],[104,105,106,107,108,109,110,111,112,113,114,115],"reinforcement-learning","machine-learning","artificial-intelligence","deep-reinforcement-learning","deep-learning","policy-gradients","evolution-strategies","a2c","deepmind","dqn","qlearning","ppo","2026-03-27T02:49:30.150509","2026-04-18T09:19:19.292705",[],[]]