[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Denys88--rl_games":3,"tool-Denys88--rl_games":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":78,"owner_location":79,"owner_email":78,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":83,"stars":92,"forks":93,"last_commit_at":94,"license":95,"difficulty_score":10,"env_os":96,"env_gpu":97,"env_ram":96,"env_deps":98,"category_tags":111,"github_topics":112,"view_count":10,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":116,"updated_at":117,"faqs":118,"releases":149},747,"Denys88\u002Frl_games","rl_games","RL implementations","rl_games 是一款基于 PyTorch 的高性能强化学习开源库，旨在简化并加速智能体的训练过程。它主要解决了传统强化学习框架中 GPU 利用率低、环境适配繁琐以及大规模并行训练困难等问题。\n\n通过提供端到端的 GPU 加速训练管道，rl_games 能够无缝对接 Isaac Gym、Brax 及 MuJoCo 等多种仿真环境。其核心亮点包括支持不对称 Actor-Critic 结构的 PPO 算法、多智能体训练（含去中心化与集中式 Critic）、自博弈模式以及掩码动作支持。此外，它还集成了 EnvPool 引擎以实现极高的环境执行效率，并允许将模型导出为 ONNX 格式以便部署。\n\n对于从事机器人控制、游戏 AI 开发或强化学习算法研究的技术人员来说，它是非常理想的选择。无论是希望在 NVIDIA Isaac Gym 上训练机械臂，还是在星际争霸等多智能体环境中测试策略，rl_games 都能提供稳定且高效的实现方案，帮助团队快速从实验走向落地。","# RL Games: High performance RL library\n\n**Note:** The next release will be 2.0.0 (unreleased). It migrates fully from `gym` to `gymnasium` and removes legacy environment integrations (envpool, cule).\n\n## Discord Channel Link \n* https:\u002F\u002Fdiscord.gg\u002FhnYRq7DsQh\n\n## Papers and related links\n\n* Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning: https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.10470\n* DeXtreme: Transfer of Agile In-Hand Manipulation from Simulation to Reality: https:\u002F\u002Fdextreme.org\u002F https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.13702\n* Transferring Dexterous Manipulation from GPU Simulation to a Remote Real-World TriFinger: https:\u002F\u002Fs2r2-ig.github.io\u002F https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.09779\n* Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge? \u003Chttps:\u002F\u002Farxiv.org\u002Fabs\u002F2011.09533>\n* Superfast Adversarial Motion Priors (AMP) implementation: https:\u002F\u002Ftwitter.com\u002Fxbpeng4\u002Fstatus\u002F1506317490766303235 https:\u002F\u002Fgithub.com\u002FNVIDIA-Omniverse\u002FIsaacGymEnvs\n* OSCAR: Data-Driven Operational Space Control for Adaptive and Robust Robot Manipulation: https:\u002F\u002Fcremebrule.github.io\u002Foscar-web\u002F https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.00704\n* EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine: https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.10558 and https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fenvpool\n* TimeChamber: A Massively Parallel Large Scale Self-Play Framework: https:\u002F\u002Fgithub.com\u002Finspirai\u002FTimeChamber\n\n\n## Some results on the different environments  \n\n* [NVIDIA Isaac Gym](docs\u002FISAAC_GYM.md)\n\n![Ant_running](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDenys88_rl_games_readme_b03974edb7cc.gif)\n![Humanoid_running](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDenys88_rl_games_readme_890d20c413f5.gif)\n\n![Allegro_Hand_400](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDenys88_rl_games_readme_11134deb6784.gif)\n![Shadow_Hand_OpenAI](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDenys88_rl_games_readme_2baaeebf050b.gif)\n\n* [Dextreme](https:\u002F\u002Fdextreme.org\u002F)\n\n![Allegro_Hand_real_world](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDenys88_rl_games_readme_781b8db8a5f1.gif)\n\n* [DexPBT](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdexpbt)\n\n![AllegroKuka](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDenys88_rl_games_readme_b7ba2e340337.png)\n\n* [Starcraft 2 Multi Agents](docs\u002FSMAC.md)  \n* [BRAX](docs\u002FBRAX.md)  \n* [Mujoco Envpool](docs\u002FMUJOCO_ENVPOOL.md) \n* [DeepMind Envpool](docs\u002FDEEPMIND_ENVPOOL.md) \n* [Atari Envpool](docs\u002FATARI_ENVPOOL.md) \n* [Random Envs](docs\u002FOTHER.md)  \n\n\nImplemented in Pytorch:\n\n* PPO with the support of asymmetric actor-critic variant\n* Support of end-to-end GPU accelerated training pipeline with Isaac Gym and Brax\n* Masked actions support\n* Multi-agent training, decentralized and centralized critic variants\n* Self-play \n\n Implemented in Tensorflow 1.x (was removed in this version):\n\n* Rainbow DQN\n* A2C\n* PPO\n\n## Quickstart: Colab in the Cloud\n\nExplore RL Games quick and easily in colab notebooks:\n\n* [Mujoco training](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FDenys88\u002Frl_games\u002Fblob\u002Fmaster\u002Fnotebooks\u002Fmujoco_envpool_training.ipynb) Mujoco envpool training example.\n* [Brax training](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FDenys88\u002Frl_games\u002Fblob\u002Fmaster\u002Fnotebooks\u002Fbrax_training.ipynb) Brax training example, with keeping all the observations and actions on GPU.\n* [Onnx discrete space export example with Cartpole](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FDenys88\u002Frl_games\u002Fblob\u002Fmaster\u002Fnotebooks\u002Ftrain_and_export_onnx_example_discrete.ipynb) envpool training example.\n* [Onnx continuous space export example with Pendulum](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FDenys88\u002Frl_games\u002Fblob\u002Fmaster\u002Fnotebooks\u002Ftrain_and_export_onnx_example_continuous.ipynb) envpool training example.\n* [Onnx continuous space with LSTM export example with Pendulum](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FDenys88\u002Frl_games\u002Fblob\u002Fmaster\u002Fnotebooks\u002Ftrain_and_export_onnx_example_lstm_continuous.ipynb) envpool training example.\n\n## Installation\n\nFor maximum training performance a preliminary installation of Pytorch 2.2 or newer with CUDA 12.1 or newer is highly recommended:\n\n```bash\npip3 install torch torchvision\n```\n\nThen:\n\n```bash\npip install rl-games\n``` \n\nOr clone the repo and install the latest version from source :\n```bash\npip install -e .\n```\n\nTo run CPU-based environments either envpool if supported or Ray are required ```pip install envpool``` or ```pip install ray```\nTo run Mujoco, Atari games or Box2d based environments training they need to be additionally installed with ```pip install gym[mujoco]```, ```pip install gym[atari]``` or ```pip install gym[box2d]``` respectively.\n\nTo run Atari also ```pip install opencv-python``` is required. For modern Gymnasium\u002FALE Atari environments, install ```pip install ale-py```. In addition installation of envpool for maximum simulation and training performance of Mujoco and Atari environments is highly recommended: ```pip install envpool```\n\n### EnvPool + NumPy 2+ Incompatibility\n\n**IMPORTANT:** If using EnvPool, you **must** use NumPy 1.x. NumPy 2.0+ is **NOT compatible** with EnvPool and will cause training failures ([see issue](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fenvpool\u002Fissues\u002F312)).\n\nDowngrade to NumPy 1.26.4:\n```bash\npip uninstall numpy\npip install numpy==1.26.4\n```\n\n## Citing\n\nIf you use rl-games in your research please use the following citation:\n\n```bibtex\n@misc{rl-games2021,\ntitle = {rl-games: A High-performance Framework for Reinforcement Learning},\nauthor = {Makoviichuk, Denys and Makoviychuk, Viktor},\nmonth = {May},\nyear = {2021},\npublisher = {GitHub},\njournal = {GitHub repository},\nhowpublished = {\\url{https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games}},\n}\n```\n\n\n## Development setup\n\n```bash\npoetry install\n# install cuda related dependencies\npoetry run pip install torch torchvision\n```\n\n## Training\n**NVIDIA Isaac Gym**\n\nDownload and follow the installation instructions of Isaac Gym: https:\u002F\u002Fdeveloper.nvidia.com\u002Fisaac-gym  \nAnd IsaacGymEnvs: https:\u002F\u002Fgithub.com\u002FNVIDIA-Omniverse\u002FIsaacGymEnvs\n\n*Ant*\n\n```bash\npython train.py task=Ant headless=True\npython train.py task=Ant test=True checkpoint=nn\u002FAnt.pth num_envs=100\n```\n\n*Humanoid*\n\n```bash\npython train.py task=Humanoid headless=True\npython train.py task=Humanoid test=True checkpoint=nn\u002FHumanoid.pth num_envs=100\n```\n\n*Shadow Hand block orientation task*\n\n```python train.py task=ShadowHand headless=True```\n```python train.py task=ShadowHand test=True checkpoint=nn\u002FShadowHand.pth num_envs=100```\n\n**Other**\n\n*Atari Pong*\n\n```bash\npython runner.py --train --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_pong_envpool.yaml\npython runner.py --play --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_pong_envpool.yaml --checkpoint nn\u002FPong-v5_envpool.pth\n```\n\nOr with poetry:\n\n```bash\npoetry install -E atari\npoetry run python runner.py --train --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_pong.yaml\npoetry run python runner.py --play --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_pong.yaml --checkpoint nn\u002FPongNoFrameskip.pth\n```\n\n*Brax Ant*\n\n```bash\npip install -U \"jax[cuda12]\"\npip install brax\npython runner.py --train --file rl_games\u002Fconfigs\u002Fbrax\u002Fppo_ant.yaml\npython runner.py --play --file rl_games\u002Fconfigs\u002Fbrax\u002Fppo_ant.yaml --checkpoint runs\u002FAnt_brax\u002Fnn\u002FAnt_brax.pth\n```\n\n## Experiment tracking\n\nrl_games support experiment tracking with [Weights and Biases](https:\u002F\u002Fwandb.ai).\n\n```bash\npython runner.py --train --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_breakout_torch.yaml --track\nWANDB_API_KEY=xxxx python runner.py --train --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_breakout_torch.yaml --track\npython runner.py --train --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_breakout_torch.yaml --wandb-project-name rl-games-special-test --track\npython runner.py --train --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_breakout_torch.yaml --wandb-project-name rl-games-special-test -wandb-entity openrlbenchmark --track\n```\n\n\n## Multi GPU\n\nWe use `torchrun` to orchestrate any multi-gpu runs.\n\n```bash\ntorchrun --standalone --nnodes=1 --nproc_per_node=2 runner.py --train --file rl_games\u002Fconfigs\u002Fppo_cartpole.yaml\n```\n\n## Config Parameters\n\n| Field                  | Example Value             | Default | Description                                                                                                                                                  |\n| ---------------------- | ------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |\n| seed                   | 8                         | None    | Seed for pytorch, numpy etc.                                                                                                                                 |\n| algo                   |                           |         | Algorithm block.                                                                                                                                             |\n| name                   | a2c_continuous            | None    | Algorithm name. Possible values are: sac, a2c_discrete, a2c_continuous                                                                                       |\n| model                  |                           |         | Model block.                                                                                                                                                 |\n| name                   | continuous_a2c_logstd     | None    | Possible values: continuous_a2c ( expects sigma to be (0, +inf), continuous_a2c_logstd  ( expects sigma to be (-inf, +inf), a2c_discrete, a2c_multi_discrete |\n| network                |                           |         | Network description.                                                                                                                                         |\n| name                   | actor_critic              |         | Possible values: actor_critic or soft_actor_critic.                                                                                                          |\n| separate               | False                     |         | Whether use or not separate network with same same architecture for critic. In almost all cases if you normalize value it is better to have it False         |\n| space                  |                           |         | Network space                                                                                                                                                |\n| continuous             |                           |         | continuous or discrete                                                                                                                                       |\n| mu_activation          | None                      |         | Activation for mu. In almost all cases None works the best, but we may try tanh.                                                                             |\n| sigma_activation       | None                      |         | Activation for sigma. Will be threated as log(sigma) or sigma depending on model.                                                                            |\n| mu_init                |                           |         | Initializer for mu.                                                                                                                                          |\n| name                   | default                   |         |                                                                                                                                                              |\n| sigma_init             |                           |         | Initializer for sigma. if you are using logstd model good value is 0.                                                                                        |\n| name                   | const_initializer         |         |                                                                                                                                                              |\n| val                    | 0                         |         |                                                                                                                                                              |\n| fixed_sigma            | True                      |         | If true then sigma vector doesn't depend on input.                                                                                                           |\n| cnn                    |                           |         | Convolution block.                                                                                                                                           |\n| type                   | conv2d                    |         | Type: right now two types supported: conv2d or conv1d                                                                                                        |\n| activation             | elu                       |         | activation between conv layers.                                                                                                                              |\n| initializer            |                           |         | Initialier. I took some names from the tensorflow.                                                                                                           |\n| name                   | glorot_normal_initializer |         | Initializer name                                                                                                                                             |\n| gain                   | 1.4142                    |         | Additional parameter.                                                                                                                                        |\n| convs                  |                           |         | Convolution layers. Same parameters as we have in torch.                                                                                                     |\n| filters                | 32                        |         | Number of filters.                                                                                                                                           |\n| kernel_size            | 8                         |         | Kernel size.                                                                                                                                                 |\n| strides                | 4                         |         | Strides                                                                                                                                                      |\n| padding                | 0                         |         | Padding                                                                                                                                                      |\n| filters                | 64                        |         | Next convolution layer info.                                                                                                                                 |\n| kernel_size            | 4                         |         |                                                                                                                                                              |\n| strides                | 2                         |         |                                                                                                                                                              |\n| padding                | 0                         |         |                                                                                                                                                              |\n| filters                | 64                        |         |                                                                                                                                                              |\n| kernel_size            | 3                         |         |                                                                                                                                                              |\n| strides                | 1                         |         |                                                                                                                                                              |\n| padding                | 0                         |         |\n| mlp                    |                           |         | MLP Block. Convolution is supported too. See other config examples.                                                                                          |\n| units                  |                           |         | Array of sizes of the MLP layers, for example: [512, 256, 128]                                                                                               |\n| d2rl                   | False                     |         | Use d2rl architecture from https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.09163.                                                                                                 |\n| activation             | elu                       |         | Activations between dense layers.                                                                                                                            |\n| initializer            |                           |         | Initializer.                                                                                                                                                 |\n| name                   | default                   |         | Initializer name.                                                                                                                                            |\n| rnn                    |                           |         | RNN block.                                                                                                                                                   |\n| name                   | lstm                      |         | RNN Layer name. lstm and gru are supported.                                                                                                                  |\n| units                  | 256                       |         | Number of units.                                                                                                                                             |\n| layers                 | 1                         |         | Number of layers                                                                                                                                             |\n| before_mlp             | False                     | False   | Apply rnn before mlp block or not.                                                                                                                           |\n| config                 |                           |         | RL Config block.                                                                                                                                             |\n| reward_shaper          |                           |         | Reward Shaper. Can apply simple transformations.                                                                                                             |\n| min_val                | -1                        |         | You can apply min_val, max_val, scale and shift.                                                                                                             |\n| scale_value            | 0.1                       | 1       |                                                                                                                                                              |\n| normalize_advantage    | True                      | True    | Normalize Advantage.                                                                                                                                         |\n| gamma                  | 0.995                     |         | Reward Discount                                                                                                                                              |\n| tau                    | 0.95                      |         | Lambda for GAE. Called tau by mistake long time ago because lambda is keyword in python :(                                                                   |\n| learning_rate          | 3e-4                      |         | Learning rate.                                                                                                                                               |\n| name                   | walker                    |         | Name which will be used in tensorboard.                                                                                                                      |\n| save_best_after        | 10                        |         | How many epochs to wait before start saving checkpoint with best score.                                                                                      |\n| score_to_win           | 300                       |         | If score is >=value then this value training will stop.                                                                                                      |\n| grad_norm              | 1.5                       |         | Grad norm. Applied if truncate_grads is True. Good value is in (1.0, 10.0)                                                                                   |\n| entropy_coef           | 0                         |         | Entropy coefficient. Good value for continuous space is 0. For discrete is 0.02                                                                              |\n| truncate_grads         | True                      |         | Apply truncate grads or not. It stabilizes training.                                                                                                         |\n| env_name               | BipedalWalker-v3          |         | Envinronment name.                                                                                                                                           |\n| e_clip                 | 0.2                       |         | clip parameter for ppo loss.                                                                                                                                 |\n| clip_value             | False                     |         | Apply clip to the value loss. If you are using normalize_value you don't need it.                                                                            |\n| num_actors             | 16                        |         | Number of running actors\u002Fenvironments.                                                                                                                       |\n| horizon_length         | 4096                      |         | Horizon length per each actor. Total number of steps will be num_actors*horizon_length * num_agents (if env is not MA num_agents==1).                        |\n| minibatch_size         | 8192                      |         | Minibatch size. Total number number of steps must be divisible by minibatch size.                                                                            |\n| minibatch_size_per_env | 8                         |         | Minibatch size per env. If specified will overwrite total number number the default minibatch size with minibatch_size_per_env * nume_envs value.            |\n| mini_epochs            | 4                         |         | Number of miniepochs. Good value is in [1,10]                                                                                                                |\n| critic_coef            | 2                         |         | Critic coef. by default critic_loss = critic_coef * 1\u002F2 * MSE.                                                                                               |\n| lr_schedule            | adaptive                  | None    | Scheduler type. Could be None, linear or adaptive. Adaptive is the best for continuous control tasks. Learning rate is changed changed every miniepoch       |\n| kl_threshold           | 0.008                     |         | KL threshould for adaptive schedule. if KL \u003C kl_threshold\u002F2 lr = lr * 1.5 and opposite.                                                                      |\n| normalize_input        | True                      |         | Apply running mean std for input.                                                                                                                            |\n| bounds_loss_coef       | 0.0                       |         | Coefficient to the auxiary loss for continuous space.                                                                                                        |\n| max_epochs             | 10000                     |         | Maximum number of epochs to run.                                                                                                                             |\n| max_frames             | 5000000                   |         | Maximum number of frames (env steps) to run.                                                                                                                             |\n| normalize_value        | True                      |         | Use value running mean std normalization.                                                                                                                    |\n| use_diagnostics        | True                      |         | Adds more information into the tensorboard.                                                                                                                  |\n| value_bootstrap        | True                      |         | Bootstraping value when episode is finished. Very useful for different locomotion envs.                                                                      |\n| bound_loss_type        | regularisation            | None    | Adds aux loss for continuous case. 'regularisation' is the sum of sqaured actions. 'bound' is the sum of actions higher than 1.1.                            |\n| bounds_loss_coef       | 0.0005                    | 0       | Regularisation coefficient                                                                                                                                   |\n| use_smooth_clamp       | False                     |         | Use smooth clamp instead of regular for cliping                                                                                                              |\n| zero_rnn_on_done       | False                     | True    | If False RNN internal state is not reset (set to 0) when an environment is rest. Could improve training in some cases, for example when domain randomization is on |\n| player                 |                           |         | Player configuration block.                                                                                                                                  |\n| render                 | True                      | False   | Render environment                                                                                                                                           |\n| deterministic          | True                      | True    | Use deterministic policy ( argmax or mu) or stochastic.                                                                                                      |\n| use_vecenv             | True                      | False   | Use vecenv to create environment for player                                                                                                                  |\n| games_num              | 200                       |         | Number of games to run in the player mode.                                                                                                                   |\n| env_config             |                           |         | Env configuration block. It goes directly to the environment. This example was take for my atari wrapper.                                                    |\n| skip                   | 4                         |         | Number of frames to skip                                                                                                                                     |\n| name                   | BreakoutNoFrameskip-v4    |         | The exact name of an (atari) gym env. An example, depends on the training env this parameters can be different.                                                                   |\n| evaluation             | True                      | False   | Enables the evaluation feature for inferencing while training. |\n| update_checkpoint_freq | 100                       | 100     | Frequency in number of steps to look for new checkpoints. |\n| dir_to_monitor         |                           |         | Directory to search for checkpoints in during evaluation. |\n\n## Custom network example: \n[simple test network](rl_games\u002Fenvs\u002Ftest_network.py)  \nThis network takes dictionary observation.\nTo register it you can add code in your __init__.py\n\n```\nfrom rl_games.envs.test_network import TestNetBuilder \nfrom rl_games.algos_torch import model_builder\nmodel_builder.register_network('testnet', TestNetBuilder)\n```\n[simple test environment](rl_games\u002Fenvs\u002Ftest\u002Frnn_env.py)\n[example environment](rl_games\u002Fenvs\u002Ftest\u002Fexample_env.py)  \n\nAdditional environment supported properties and functions  \n\n| Field                      | Default Value | Description                                                                                                                                                                                              |\n| -------------------------- | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| use_central_value          | False         | If true than returned obs is expected to be dict with 'obs' and 'state'                                                                                                                                  |\n| value_size                 | 1             | Shape of the returned rewards. Network wil support multihead value automatically.                                                                                                                        |\n| concat_infos               | False         | Should default vecenv convert list of dicts to the dicts of lists. Very usefull if you want to use value_boostrapping. in this case you need to always return 'time_outs' : True or False, from the env. |\n| get_number_of_agents(self) | 1             | Returns number of agents in the environment                                                                                                                                                              |\n| has_action_mask(self)      | False         | Returns True if environment has invalid actions mask.                                                                                                                                                    |\n| get_action_mask(self)      | None          | Returns action masks if  has_action_mask is true.  Good example is [SMAC Env](rl_games\u002Fenvs\u002Ftest\u002Fsmac_env.py)                                                                                            |\n\n\n## Release Notes\n\n1.6.5\n\n* Added torch.compile support with configurable modes. Provides 10-40% performance improvement. Requires torch 2.2 or newer.\n  * Default mode is `reduce-overhead` for balanced compilation time and runtime performance\n  * Configurable via `torch_compile` parameter in yaml configs (true\u002Ffalse\u002F\"default\"\u002F\"reduce-overhead\"\u002F\"max-autotune\")\n  * Separate compilation modes for actor and central value networks\n  * See [torch.compile documentation](docs\u002FTORCH_COMPILE.md) for detailed configuration and mode selection guidance\n* Fixed critical bugs in asymmetric actor-critic (central_value) training:\n  * Fixed incorrect device reference in `update_lr()` method\n  * Fixed infinite loop when iterating over dataset\n  * Added proper `__iter__` method to `PPODataset` class\n* Fixed variance calculation in `RunningMeanStd` to use population variance\n* Fixed get_mean_std_with_masks function.\n* Fixed missing central value optimizer state in checkpoint save\u002Fload\n* Added myosuite support.\n* Added auxilary loss support.\n* Update for tacsl release: CNN tower processing, critic weights loading and freezing.\n* Fixed SAC input normalization.\n* Fixed SAC agent summary writer to use configured directory instead of hardcoded 'runs\u002F'\n* Fixed default player config num_games value.\n* Fixed applying minibatch size per env.\n* Added concat_output support for RNN.\n* SAC improvements:\n  * Fixed missing `gamma_tensor` initialization bug\n  * Removed hardcoded torch.compile decorators (now respects YAML config)\n  * Optimized tensor operations and removed unnecessary clones\n* Environment wrapper fixes:\n  * Fixed tuple\u002Flist observation handling for compatibility with various gym environments\n  * Added proper numpy to torch tensor conversion in `cast_obs`\n  * Fixed missing gym import in envpool wrapper\n* Ray integration improvements:\n  * Moved Ray import to lazy loading (only when RayVecEnv is used)\n  * Added configurable Ray initialization with `ray_config` parameter\n  * Added proper cleanup with `close()` method for Ray actors\n  * Default 1GB object store memory allocation\n\n\n1.6.1\n\n* Fixed Central Value RNN bug which occurs if you train ma multi agent environment.\n* Added Deepmind Control PPO benchmark.\n* Added a few more experimental ways to train value prediction (OneHot, TwoHot encoding and crossentropy loss instead of L2).\n* New methods didn't. It is impossible to turn it on from the yaml files. Once we find an env which trains better it will be added to the config.\n* Added shaped reward graph to the tensorboard.\n* Fixed bug with SAC not saving weights with save_frequency.\n* Added multi-node training support for GPU-accelerated training environments like Isaac Gym. No changes in training scripts are required. Thanks to @ankurhanda and @ArthurAllshire for assistance in implementation.\n* Added evaluation feature for inferencing during training. Checkpoints from training process can be automatically picked up and updated in the inferencing process when enabled.Enhanced\n* Added get\u002Fset API for runtime update of rl training parameters. Thanks to @ArthurAllshire for the initial version of fast PBT code.\n* Fixed SAC not loading weights properly.\n* Removed Ray dependency for use cases it's not required.\n* Added warning for using deprecated 'seq_len' instead of 'seq_length' in configs with RNN networks.\n\n\n1.6.0\n\n* Added ONNX export colab example for discrete and continious action spaces. For continuous case LSTM policy example is provided as well.\n* Improved RNNs training in continuous space, added option `zero_rnn_on_done`.\n* Added NVIDIA CuLE support: https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fcule\n* Added player config everride. Vecenv is used for inference.\n* Fixed multi-gpu training with central value.\n* Fixed max_frames termination condition, and it's interaction with the linear learning rate: https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games\u002Fissues\u002F212\n* Fixed \"deterministic\" misspelling issue.\n* Fixed Mujoco and Brax SAC configs.\n* Fixed multiagent envs statistics reporting. Fixed Starcraft2 SMAC environments.\n\n1.5.2\n\n* Added observation normalization to the SAC.\n* Returned back adaptive KL legacy mode.\n\n1.5.1\n\n* Fixed build package issue.\n\n1.5.0\n\n* Added wandb support.\n* Added poetry support.\n* Fixed various bugs.\n* Fixed cnn input was not divided by 255 in case of the dictionary obs.\n* Added more envpool mujoco and atari training examples. Some of the results: 15 min Mujoco humanoid training, 2 min atari pong.\n* Added Brax and Mujoco colab training examples.\n* Added 'seed' command line parameter. Will override seed in config in case it's > 0.\n* Deprecated `horovod` in favor of `torch.distributed` ([#171](https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games\u002Fpull\u002F171)).\n\n1.4.0\n\n* Added discord channel https:\u002F\u002Fdiscord.gg\u002FhnYRq7DsQh :)\n* Added envpool support with a few atari examples. Works 3-4x time faster than ray.\n* Added mujoco results. Much better than openai spinning up ppo results.\n* Added tcnn(https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn) support. Reduces 5-10% of training time in the IsaacGym envs. \n* Various fixes and improvements.\n\n1.3.2\n\n* Added 'sigma' command line parameter. Will override sigma for continuous space in case if fixed_sigma is True.\n\n1.3.1\n\n* Fixed SAC not working\n\n1.3.0\n\n* Simplified rnn implementation. Works a little bit slower but much more stable. \n* Now central value can be non-rnn if policy is rnn.\n* Removed load_checkpoint from the yaml file. now --checkpoint works for both train and play.\n\n1.2.0\n\n* Added Swish (SILU) and GELU activations, it can improve Isaac Gym results for some of the envs.\n* Removed tensorflow and made initial cleanup of the old\u002Funused code.\n* Simplified runner.\n* Now networks are created in the algos with load_network method.\n\n1.1.4\n\n* Fixed crash in a play (test) mode in player, when simulation and rl_devices are not the same.\n* Fixed variuos multi gpu errors.\n\n1.1.3\n\n* Fixed crash when running single Isaac Gym environment in a play (test) mode.\n* Added config parameter ```clip_actions``` for switching off internal action clipping and rescaling\n\n1.1.0\n\n* Added to pypi: ```pip install rl-games```\n* Added reporting env (sim) step fps, without policy inference. Improved naming.\n* Renames in yaml config for better readability: steps_num to horizon_length amd lr_threshold to kl_threshold\n\n\n\n## Troubleshouting\n\n* Some of the supported envs are not installed with setup.py, you need to manually install them\n* Starting from rl-games 1.1.0 old yaml configs won't be compatible with the new version: \n    * ```steps_num``` should be changed to ```horizon_length``` amd ```lr_threshold``` to ```kl_threshold```\n\n## Known issues\n\n* Running a single environment with Isaac Gym can cause crash, if it happens switch to at least 2 environments simulated in parallel\n    \n\n","# RL Games：高性能强化学习 (RL) 库\n\n**注意：** 下一个版本将是 2.0.0（未发布）。它将从 `gym` 完全迁移到 `gymnasium`，并移除旧的环境集成（envpool, cule）。\n\n## Discord 频道链接 \n* https:\u002F\u002Fdiscord.gg\u002FhnYRq7DsQh\n\n## 论文及相关链接\n\n* Isaac Gym：用于机器人学习的高性能基于 GPU 的物理仿真：https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.10470\n* DeXtreme：从仿真到现实的敏捷手中操作转移：https:\u002F\u002Fdextreme.org\u002F https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.13702\n* 将灵巧操作从 GPU 仿真转移到远程真实世界 TriFinger：https:\u002F\u002Fs2r2-ig.github.io\u002F https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.09779\n* 在星际争霸多智能体挑战中，独立学习是唯一的必要条件吗？\u003Chttps:\u002F\u002Farxiv.org\u002Fabs\u002F2011.09533>\n* Superfast 对抗运动先验 (AMP) 实现：https:\u002F\u002Ftwitter.com\u002Fxbpeng4\u002Fstatus\u002F1506317490766303235 https:\u002F\u002Fgithub.com\u002FNVIDIA-Omniverse\u002FIsaacGymEnvs\n* OSCAR：面向自适应和鲁棒机器人操作的数据驱动操作空间控制：https:\u002F\u002Fcremebrule.github.io\u002Foscar-web\u002F https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.00704\n* EnvPool：高度并行的强化学习环境执行引擎：https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.10558 和 https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fenvpool\n* TimeChamber：大规模并行自博弈框架：https:\u002F\u002Fgithub.com\u002Finspirai\u002FTimeChamber\n\n\n## 不同环境中的一些结果  \n\n* [NVIDIA Isaac Gym](docs\u002FISAAC_GYM.md)\n\n![Ant_running](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDenys88_rl_games_readme_b03974edb7cc.gif)\n![Humanoid_running](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDenys88_rl_games_readme_890d20c413f5.gif)\n\n![Allegro_Hand_400](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDenys88_rl_games_readme_11134deb6784.gif)\n![Shadow_Hand_OpenAI](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDenys88_rl_games_readme_2baaeebf050b.gif)\n\n* [Dextreme](https:\u002F\u002Fdextreme.org\u002F)\n\n![Allegro_Hand_real_world](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDenys88_rl_games_readme_781b8db8a5f1.gif)\n\n* [DexPBT](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdexpbt)\n\n![AllegroKuka](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDenys88_rl_games_readme_b7ba2e340337.png)\n\n* [星际争霸 2 多智能体](docs\u002FSMAC.md)  \n* [BRAX](docs\u002FBRAX.md)  \n* [Mujoco Envpool](docs\u002FMUJOCO_ENVPOOL.md) \n* [DeepMind Envpool](docs\u002FDEEPMIND_ENVPOOL.md) \n* [Atari Envpool](docs\u002FATARI_ENVPOOL.md) \n* [随机环境](docs\u002FOTHER.md)  \n\n\n使用 PyTorch 实现：\n\n* 支持非对称 Actor-Critic (演员 - 评论家) 变体的 PPO (近端策略优化)\n* 支持使用 Isaac Gym 和 Brax 的端到端 GPU (图形处理器) 加速训练流程\n* 支持掩码动作\n* 多智能体训练，去中心化和集中式 Critic (评论家) 变体\n* 自博弈 \n\n使用 TensorFlow 1.x 实现（此版本已移除）：\n\n* Rainbow DQN (深度 Q 网络)\n* A2C (异步优势演员 - 评论家)\n* PPO (近端策略优化)\n\n## 快速开始：云端 Colab\n\n在 Colab 笔记本中快速轻松地探索 RL Games：\n\n* [Mujoco 训练](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FDenys88\u002Frl_games\u002Fblob\u002Fmaster\u002Fnotebooks\u002Fmujoco_envpool_training.ipynb) Mujoco envpool 训练示例。\n* [Brax 训练](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FDenys88\u002Frl_games\u002Fblob\u002Fmaster\u002Fnotebooks\u002Fbrax_training.ipynb) Brax 训练示例，保持所有观测值和动作在 GPU (图形处理器) 上。\n* [Cartpole 的 ONNX (开放神经网络交换) 离散空间导出示例](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FDenys88\u002Frl_games\u002Fblob\u002Fmaster\u002Fnotebooks\u002Ftrain_and_export_onnx_example_discrete.ipynb) envpool 训练示例。\n* [Pendulum 的 ONNX (开放神经网络交换) 连续空间导出示例](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FDenys88\u002Frl_games\u002Fblob\u002Fmaster\u002Fnotebooks\u002Ftrain_and_export_onnx_example_continuous.ipynb) envpool 训练示例。\n* [带有 LSTM (长短期记忆网络) 的 Pendulum ONNX (开放神经网络交换) 连续空间导出示例](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FDenys88\u002Frl_games\u002Fblob\u002Fmaster\u002Fnotebooks\u002Ftrain_and_export_onnx_example_lstm_continuous.ipynb) envpool 训练示例。\n\n## 安装\n\n为了获得最大的训练性能，强烈建议预先安装 PyTorch 2.2 或更高版本以及 CUDA 12.1 或更高版本：\n\n```bash\npip3 install torch torchvision\n```\n\n然后：\n\n```bash\npip install rl-games\n``` \n\n或者克隆仓库并从源代码安装最新版本：\n```bash\npip install -e .\n```\n\n要运行基于 CPU (中央处理器) 的环境，需要安装 envpool（如果支持）或 Ray：```pip install envpool``` 或 ```pip install ray```\n要运行 Mujoco、Atari 游戏或基于 Box2d 的环境训练，需要分别额外安装 ```pip install gym[mujoco]```、```pip install gym[atari]``` 或 ```pip install gym[box2d]```。\n\n运行 Atari 还需要 ```pip install opencv-python```。对于现代的 Gymnasium\u002FALE Atari 环境，请安装 ```pip install ale-py```。此外，强烈建议安装 envpool 以获得 Mujoco 和 Atari 环境的最大模拟和训练性能：```pip install envpool```\n\n### EnvPool + NumPy 2+ 不兼容问题\n\n**重要：** 如果使用 EnvPool，您**必须**使用 NumPy 1.x。NumPy 2.0+ **不兼容** EnvPool 并将导致训练失败（[查看问题](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fenvpool\u002Fissues\u002F312)）。\n\n降级到 NumPy 1.26.4：\n```bash\npip uninstall numpy\npip install numpy==1.26.4\n```\n\n## 引用\n\n如果您在研究中使用 rl-games，请使用以下引用：\n\n```bibtex\n@misc{rl-games2021,\ntitle = {rl-games: A High-performance Framework for Reinforcement Learning},\nauthor = {Makoviichuk, Denys and Makoviychuk, Viktor},\nmonth = {May},\nyear = {2021},\npublisher = {GitHub},\njournal = {GitHub repository},\nhowpublished = {\\url{https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games}},\n}\n```\n\n\n## 开发环境设置\n\n```bash\npoetry install\n# 安装 cuda 相关依赖\npoetry run pip install torch torchvision\n```\n\n## 训练\n**NVIDIA Isaac Gym**\n\n下载并遵循 Isaac Gym (NVIDIA 强化学习仿真平台) 的安装说明：https:\u002F\u002Fdeveloper.nvidia.com\u002Fisaac-gym  \n以及 IsaacGymEnvs：https:\u002F\u002Fgithub.com\u002FNVIDIA-Omniverse\u002FIsaacGymEnvs\n\n*Ant*\n\n```bash\npython train.py task=Ant headless=True\npython train.py task=Ant test=True checkpoint=nn\u002FAnt.pth num_envs=100\n```\n\n*Humanoid*\n\n```bash\npython train.py task=Humanoid headless=True\npython train.py task=Humanoid test=True checkpoint=nn\u002FHumanoid.pth num_envs=100\n```\n\n*Shadow Hand 方块朝向任务*\n\n```python train.py task=ShadowHand headless=True```\n```python train.py task=ShadowHand test=True checkpoint=nn\u002FShadowHand.pth num_envs=100```\n\n**其他**\n\n*Atari Pong*\n\n```bash\npython runner.py --train --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_pong_envpool.yaml\npython runner.py --play --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_pong_envpool.yaml --checkpoint nn\u002FPong-v5_envpool.pth\n```\n\n或者使用 Poetry (Python 包管理工具)：\n\n```bash\npoetry install -E atari\npoetry run python runner.py --train --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_pong.yaml\npoetry run python runner.py --play --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_pong.yaml --checkpoint nn\u002FPongNoFrameskip.pth\n```\n\n*Brax Ant*\n\n```bash\npip install -U \"jax[cuda12]\"\npip install brax\npython runner.py --train --file rl_games\u002Fconfigs\u002Fbrax\u002Fppo_ant.yaml\npython runner.py --play --file rl_games\u002Fconfigs\u002Fbrax\u002Fppo_ant.yaml --checkpoint runs\u002FAnt_brax\u002Fnn\u002FAnt_brax.pth\n```\n\n## 实验跟踪\n\nrl_games 支持通过 [Weights and Biases (W&B)](https:\u002F\u002Fwandb.ai) 进行实验跟踪。\n\n```bash\npython runner.py --train --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_breakout_torch.yaml --track\nWANDB_API_KEY=xxxx python runner.py --train --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_breakout_torch.yaml --track\npython runner.py --train --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_breakout_torch.yaml --wandb-project-name rl-games-special-test --track\npython runner.py --train --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_breakout_torch.yaml --wandb-project-name rl-games-special-test -wandb-entity openrlbenchmark --track\n```\n\n\n## 多 GPU\n\n我们使用 `torchrun` (PyTorch 分布式运行工具) 来编排所有多 GPU 运行。\n\n```bash\ntorchrun --standalone --nnodes=1 --nproc_per_node=2 runner.py --train --file rl_games\u002Fconfigs\u002Fppo_cartpole.yaml\n```\n\n## 配置参数\n\n| Field                  | Example Value             | Default | Description                                                                                                                                                  |\n| ---------------------- | ------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |\n| seed                   | 8                         | None    | 随机种子。用于 PyTorch、NumPy 等。                                                                                                                         |\n| algo                   |                           |         | 算法块。                                                                                                                                                     |\n| name                   | a2c_continuous            | None    | 算法名称。可能值为：sac, a2c_discrete, a2c_continuous                                                                                                        |\n| model                  |                           |         | 模型块。                                                                                                                                                     |\n| name                   | continuous_a2c_logstd     | None    | 可能值：continuous_a2c（期望 sigma 为 (0, +inf)）, continuous_a2c_logstd（期望 sigma 为 (-inf, +inf)）, a2c_discrete, a2c_multi_discrete |\n| network                |                           |         | 网络描述。                                                                                                                                                   |\n| name                   | actor_critic              |         | 可能值：actor_critic 或 soft_actor_critic。                                                                                                                  |\n| separate               | False                     |         | 是否使用具有相同架构的独立网络作为 Critic（评论家网络）。在几乎所有情况下，如果您归一化价值，最好将其设为 False                                           |\n| space                  |                           |         | 网络空间                                                                                                                                                     |\n| continuous             |                           |         | 连续或离散                                                                                                                                                   |\n| mu_activation          | None                      |         | mu 的激活函数。在几乎所有情况下 None 效果最好，但我们也可以尝试 tanh。                                                                             |\n| sigma_activation       | None                      |         | sigma 的激活函数。根据模型的不同，将被视为 log(sigma) 或 sigma。                                                                            |\n| mu_init                |                           |         | mu 的初始化器。                                                                                                                                                |\n| name                   | default                   |         |                                                                                                                                                              |\n| sigma_init             |                           |         | sigma 的初始化器。如果您使用 logstd 模型，好的值是 0。                                                                                        |\n| name                   | const_initializer         |         |                                                                                                                                                              |\n| val                    | 0                         |         |                                                                                                                                                              |\n| fixed_sigma            | True                      |         | 如果为 True，则 sigma 向量不依赖于输入。                                                                                                           |\n| cnn                    |                           |         | 卷积块。                                                                                                                                                   |\n| type                   | conv2d                    |         | 类型：目前支持两种类型：conv2d 或 conv1d                                                                                                        |\n| activation             | elu                       |         | 卷积层之间的激活函数。                                                                                                                               |\n| initializer            |                           |         | 初始化器。我参考了一些 TensorFlow 的名称。                                                                                                           |\n| name                   | glorot_normal_initializer |         | 初始化器名称                                                                                                                                             |\n| gain                   | 1.4142                    |         | 附加参数。                                                                                                                                                 |\n| convs                  |                           |         | 卷积层。与 Torch 中的参数相同。                                                                                                     |\n| filters                | 32                        |         | 滤波器数量。                                                                                                                                               |\n| kernel_size            | 8                         |         | 核大小。                                                                                                                                                   |\n| strides                | 4                         |         | 步长                                                                                                                                                       |\n| padding                | 0                         |         | 填充                                                                                                                                                       |\n| filters                | 64                        |         | 下一个卷积层信息。                                                                                                                                 |\n| kernel_size            | 4                         |         |                                                                                                                                                              |\n| strides                | 2                         |         |                                                                                                                                                              |\n| padding                | 0                         |         |                                                                                                                                                              |\n| filters                | 64                        |         |                                                                                                                                                              |\n| kernel_size            | 3                         |         |                                                                                                                                                              |\n| strides                | 1                         |         |                                                                                                                                                              |\n| padding                | 0                         |         |\n| mlp                    |                           |         | MLP（多层感知机）块。也支持卷积。请参见其他配置示例。                                                                                          |\n| units                  |                           |         | MLP 层的尺寸数组，例如：[512, 256, 128]                                                                                               |\n| d2rl                   | False                     |         | 使用来自 https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.09163 的 d2rl 架构。                                                                                                 |\n| activation             | elu                       |         | 全连接层之间的激活函数。                                                                                                                            |\n| initializer            |                           |         | 初始化器。                                                                                                                                                 |\n| name                   | default                   |         | 初始化器名称。                                                                                                                                            |\n| rnn                    |                           |         | RNN（循环神经网络）块。                                                                                                                                   |\n| name                   | lstm                      |         | RNN 层名称。支持 lstm 和 gru。                                                                                                                  |\n| units                  | 256                       |         | 单元数量。                                                                                                                                               |\n| layers                 | 1                         |         | 层数                                                                                                                                                       |\n| before_mlp             | False                     | False   | 是否在 mlp 块之前应用 rnn。                                                                                                                           |\n| config                 |                           |         | 强化学习配置块。                                                                                                                                             |\n| reward_shaper          |                           |         | 奖励塑形器。可以应用简单的变换。                                                                                                             |\n| min_val                | -1                        |         | 您可以应用 min_val, max_val, scale 和 shift。                                                                                                             |\n| scale_value            | 0.1                       | 1       |                                                                                                                                                              |\n| normalize_advantage    | True                      | True    | 归一化优势。                                                                                                                                               |\n| gamma                  | 0.995                     |         | 奖励折扣                                                                                                                                                 |\n| tau                    | 0.95                      |         | GAE（广义优势估计）的 Lambda。很久以前错误地称为 tau，因为 lambda 是 Python 的关键字 :(                                                                   |\n| learning_rate          | 3e-4                      |         | 学习率。                                                                                                                                                 |\n| name                   | walker                    |         | 将在 TensorBoard 中使用的名称。                                                                                                                      |\n| save_best_after        | 10                        |         | 在开始保存具有最佳分数的检查点之前要等待多少个 epoch。                                                                                      |\n| score_to_win           | 300                       |         | 如果分数 >= 该值，则训练将停止。                                                                                                      |\n| grad_norm              | 1.5                       |         | 梯度范数。如果 truncate_grads 为 True 则应用。好的值在 (1.0, 10.0) 之间                                                                                   |\n| entropy_coef           | 0                         |         | 熵系数。连续空间的较好值为 0。离散空间为 0.02                                                                              |\n| truncate_grads         | True                      |         | 是否应用截断梯度。它有助于稳定训练。                                                                                                         |\n| env_name               | BipedalWalker-v3          |         | 环境名称。                                                                                                                                               |\n| e_clip                 | 0.2                       |         | PPO 损失的 clip 参数。                                                                                                                                 |\n| clip_value             | False                     |         | 对价值损失应用 clip。如果您使用 normalize_value，则不需要它。                                                                            |\n| num_actors             | 16                        |         | 运行智能体\u002F环境的数量。                                                                                                                       |\n| horizon_length         | 4096                      |         | 每个智能体的时间跨度长度。总步数将为 num_actors*horizon_length * num_agents（如果环境不是多智能体，num_agents==1）。                        |\n| minibatch_size         | 8192                      |         | 小批量大小。总步数必须能被小批量大小整除。                                                                            |\n| minibatch_size_per_env | 8                         |         | 每个环境的小批量大小。如果指定，将用 minibatch_size_per_env * nume_envs 的值覆盖默认的小批量大小总数。            |\n| mini_epochs            | 4                         |         | 小轮次的数量。好的值在 [1,10] 之间                                                                                                                |\n| critic_coef            | 2                         |         | Critic 系数。默认 critic_loss = critic_coef * 1\u002F2 * MSE。                                                                                               |\n| lr_schedule            | adaptive                  | None    | 调度器类型。可以是 None、linear 或 adaptive。对于连续控制任务，Adaptive 是最好的。学习率在每个 miniepoch 都会改变       |\n| kl_threshold           | 0.008                     |         | 自适应调度的 KL 阈值。如果 KL \u003C kl_threshold\u002F2，lr = lr * 1.5，反之亦然。                                                                      |\n| normalize_input        | True                      |         | 对输入应用运行均值标准差。                                                                                                                            |\n| bounds_loss_coef       | 0.0                       |         | 连续空间的辅助损失系数。                                                                                                        |\n| max_epochs             | 10000                     |         | 运行的最大轮次数。                                                                                                                             |\n| max_frames             | 5000000                   |         | 运行的最大帧数（环境步数）。                                                                                                                             |\n| normalize_value        | True                      |         | 使用价值运行均值标准差归一化。                                                                                                                    |\n| use_diagnostics        | True                      |         | 向 TensorBoard 添加更多信息。                                                                                                                  |\n| value_bootstrap        | True                      |         | 当回合结束时引导价值。对不同运动环境非常有用。                                                                      |\n| bound_loss_type        | regularisation            | None    | 为连续情况添加辅助损失。'regularisation' 是动作平方和。'bound' 是大于 1.1 的动作之和。                            |\n| bounds_loss_coef       | 0.0005                    | 0       | 正则化系数                                                                                                                                   |\n| use_smooth_clamp       | False                     |         | 使用平滑钳制代替常规进行裁剪                                                                                                              |\n| zero_rnn_on_done       | False                     | True    | 如果为 False，当环境重置时，RNN 内部状态不会重置（设置为 0）。在某些情况下可能会改善训练，例如当启用域随机化时 |\n| player                 |                           |         | 玩家配置块。                                                                                                                                                 |\n| render                 | True                      | False   | 渲染环境                                                                                                                                                   |\n| deterministic          | True                      | True    | 使用确定性策略（argmax 或 mu）或随机策略。                                                                                                      |\n| use_vecenv             | True                      | False   | 使用 vecenv 为玩家创建环境                                                                                                                  |\n| games_num              | 200                       |         | 玩家模式下运行的游戏数量。                                                                                                                   |\n| env_config             |                           |         | 环境配置块。它直接传递给环境。此示例取自我的 Atari 包装器。                                                    |\n| skip                   | 4                         |         | 跳过的帧数                                                                                                                                                 |\n| name                   | BreakoutNoFrameskip-v4    |         | (Atari) Gym 环境的精确名称。例如，取决于训练环境，此参数可能不同。                                                                   |\n| evaluation             | True                      | False   | 启用训练时的推理评估功能。 |\n| update_checkpoint_freq | 100                       | 100     | 查找新检查点的步骤频率。 |\n| dir_to_monitor         |                           |         | 评估期间搜索检查点的目录。 |\n\n## 自定义网络示例：\n[简单测试网络](rl_games\u002Fenvs\u002Ftest_network.py)  \n该网络接收字典形式的观测 (observation)。\n要注册它，你可以在你的 __init__.py 中添加代码\n\n```\nfrom rl_games.envs.test_network import TestNetBuilder \nfrom rl_games.algos_torch import model_builder\nmodel_builder.register_network('testnet', TestNetBuilder)\n```\n[简单测试环境](rl_games\u002Fenvs\u002Ftest\u002Frnn_env.py)\n[示例环境](rl_games\u002Fenvs\u002Ftest\u002Fexample_env.py)  \n\n额外支持的环境属性和函数  \n\n| Field                      | Default Value | Description                                                                                                                                                                                              |\n| -------------------------- | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| use_central_value          | False         | 如果为真，则返回的观测 (obs) 预期为包含 'obs' 和 'state' 的字典 (dict)                                                                                                                                  |\n| value_size                 | 1             | 返回奖励的形状。网络将自动支持多头价值 (multihead value)。                                                                                                                        |\n| concat_infos               | False         | 默认的 vecenv (向量环境) 是否应将字典列表转换为列表的字典。如果您想使用 value_bootstrap (价值引导)，这非常有用。在这种情况下，您需要始终从环境中返回 'time_outs' : True 或 False，来自环境。 |\n| get_number_of_agents(self) | 1             | 返回环境中智能体 (agents) 的数量                                                                                                                                                              |\n| has_action_mask(self)      | False         | 如果环境具有无效动作掩码 (action_mask)，则返回 True。                                                                                                                                                    |\n| get_action_mask(self)      | None          | 如果 has_action_mask 为真，则返回动作掩码。一个好的例子是 [SMAC 环境](rl_games\u002Fenvs\u002Ftest\u002Fsmac_env.py)                                                                                            |\n\n## 发布说明\n\n1.6.5\n\n* 添加了支持可配置模式的 `torch.compile`（PyTorch 编译）支持。提供 10-40% 的性能提升。需要 torch 2.2 或更高版本。\n  * 默认模式为 `reduce-overhead`，用于平衡编译时间和运行时性能\n  * 可通过 yaml 配置（配置文件格式）中的 `torch_compile` 参数配置（true\u002Ffalse\u002F\"default\"\u002F\"reduce-overhead\"\u002F\"max-autotune\"）\n  * Actor 和中央价值网络具有独立的编译模式\n  * 参见 [torch.compile 文档](docs\u002FTORCH_COMPILE.md) 以获取详细的配置和模式选择指南\n* 修复了非对称 actor-critic（演员 - 评论家架构）（central_value）训练中的关键错误：\n  * 修复了 `update_lr()` 方法中错误的设备引用\n  * 修复了遍历数据集时的无限循环问题\n  * 为 `PPODataset` 类添加了正确的 `__iter__` 方法\n* 修复了 `RunningMeanStd` 中的方差计算，改用总体方差\n* 修复了 `get_mean_std_with_masks` 函数。\n* 修复了检查点保存\u002F加载时缺失的中央价值优化器状态\n* 添加了 myosuite 支持。\n* 添加了辅助损失（auxiliary loss）支持。\n* Tacsl 更新：CNN（卷积神经网络）塔处理、critic 权重加载和冻结。\n* 修复了 SAC（Soft Actor-Critic）输入归一化。\n* 修复了 SAC agent summary writer，使其使用配置的目录而不是硬编码的 'runs\u002F'\n* 修复了默认 player 配置中的 num_games 值。\n* 修复了每个环境应用 minibatch 大小的问题。\n* 为 RNN（循环神经网络）添加了 concat_output 支持。\n* SAC 改进：\n  * 修复了缺失的 `gamma_tensor` 初始化错误\n  * 移除了硬编码的 `torch.compile` 装饰器（现在遵循 YAML 配置）\n  * 优化了张量操作并移除了不必要的克隆\n* 环境包装器修复：\n  * 修复了元组\u002F列表观测值的处理，以兼容各种 gym 环境\n  * 在 `cast_obs` 中添加了正确的 numpy 到 torch 张量的转换\n  * 修复了 envpool 包装器中缺失的 gym 导入\n* Ray（分布式计算框架）集成改进：\n  * 将 Ray 导入移至延迟加载（仅在使用的 `RayVecEnv` 时）\n  * 添加了带 `ray_config` 参数的可配置 Ray 初始化\n  * 为 Ray actors 添加了带 `close()` 方法的正确清理\n  * 默认对象存储内存分配为 1GB\n\n\n1.6.1\n\n* 修复了在训练多智能体（multi-agent）环境时发生的 Central Value RNN 错误。\n* 添加了 Deepmind Control PPO 基准测试。\n* 添加了几种训练价值预测的实验性方法（OneHot、TwoHot 编码以及交叉熵损失代替 L2）。\n* 新方法尚未启用。无法从 yaml 文件中开启它。一旦我们找到训练效果更好的环境，就会将其添加到配置中。\n* 将 shaped reward 图添加到 tensorboard（可视化工具）。\n* 修复了 SAC 不根据 save_frequency 保存权重的错误。\n* 为 Isaac Gym（NVIDIA 物理仿真平台）等 GPU 加速训练环境添加了多节点训练支持。无需更改训练脚本。感谢 @ankurhanda 和 @ArthurAllshire 在实现上的协助。\n* 添加了训练期间推理的评估功能。启用后，训练过程中的检查点可以自动被推理过程拾取和更新。增强版。\n* 添加了用于运行时更新 RL 训练参数的 get\u002Fset API。感谢 @ArthurAllshire 提供了快速 PBT 代码的初始版本。\n* 修复了 SAC 无法正确加载权重的问题。\n* 移除了在不需要的用例中对 Ray 的依赖。\n* 添加了警告，提示在使用 RNN 网络的配置中使用已弃用的 'seq_len' 而非 'seq_length'。\n\n\n1.6.0\n\n* 添加了离散和连续动作空间的 ONNX（开放神经网络交换）导出 Colab（云端笔记本）示例。对于连续情况，也提供了 LSTM（长短期记忆网络）策略示例。\n* 改进了连续空间中的 RNN 训练，添加了选项 `zero_rnn_on_done`。\n* 添加了 NVIDIA CuLE 支持：https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fcule\n* 添加了 player 配置 everride。Vecenv 用于推理。\n* 修复了带有 central value 的多 GPU 训练。\n* 修复了 max_frames 终止条件及其与线性学习率的交互：https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games\u002Fissues\u002F212\n* 修复了 \"deterministic\" 拼写错误问题。\n* 修复了 Mujoco（物理引擎）和 Brax 的 SAC 配置。\n* 修复了多智能体环境的统计报告。修复了 Starcraft2 SMAC 环境。\n\n1.5.2\n\n* 为 SAC 添加了观测值归一化。\n* 恢复了自适应 KL 旧模式。\n\n1.5.1\n\n* 修复了构建包问题。\n\n1.5.0\n\n* 添加了 wandb（Weights & Biases 实验跟踪工具）支持。\n* 添加了 poetry（Python 包管理工具）支持。\n* 修复了各种错误。\n* 修复了字典类型观测值情况下 CNN 输入未除以 255 的问题。\n* 添加了更多 envpool（环境池）mujoco 和 atari 训练示例。部分结果：15 分钟 Mujoco 人形机器人训练，2 分钟 atari pong。\n* 添加了 Brax 和 Mujoco 的 Colab 训练示例。\n* 添加了 'seed' 命令行参数。如果大于 0，将覆盖配置中的 seed。\n* 弃用 `horovod`（分布式训练框架），转而使用 `torch.distributed`（PyTorch 分布式后端） ([#171](https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games\u002Fpull\u002F171))。\n\n1.4.0\n\n* 添加了 Discord 频道 https:\u002F\u002Fdiscord.gg\u002FhnYRq7DsQh :)\n* 添加了 envpool 支持及几个 atari 示例。比 ray 快 3-4 倍。\n* 添加了 mujoco 结果。比 openai spinning up ppo 结果好得多。\n* 添加了 tcnn（Tiny CUDA Neural Networks）(https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn) 支持。减少 IsaacGym 环境中 5-10% 的训练时间。 \n* 各种修复和改进。\n\n1.3.2\n\n* 添加了 'sigma' 命令行参数。如果 fixed_sigma 为 True，将覆盖连续空间的 sigma。\n\n1.3.1\n\n* 修复了 SAC 无法工作的问题\n\n1.3.0\n\n* 简化了 RNN 实现。运行稍慢但更稳定。 \n* 现在如果策略是 RNN，Central Value 可以是非 RNN。\n* 从 yaml 文件中移除了 load_checkpoint。现在 --checkpoint 对训练和游玩都有效。\n\n1.2.0\n\n* 添加了 Swish（激活函数）(SILU) 和 GELU（高斯误差线性单元）激活函数，它可以改善某些环境的 Isaac Gym 结果。\n* 移除了 tensorflow（深度学习框架）并对旧\u002F未使用的代码进行了初步清理。\n* 简化了 runner。\n* 现在网络是在 algos 中通过 load_network 方法创建的。\n\n1.1.4\n\n* 修复了 player 在 play（测试）模式下的崩溃问题，当 simulation 和 rl_devices 不同时。\n* 修复了各种多 GPU 错误。\n\n1.1.3\n\n* 修复了在 play（测试）模式下运行单个 Isaac Gym 环境时的崩溃问题。\n* 添加了配置参数 `clip_actions`，用于关闭内部动作裁剪和重缩放\n\n1.1.0\n\n* 添加到 PyPI：`pip install rl-games`\n* 添加了报告环境（模拟）步骤 fps，不包含策略推理。改进了命名。\n* 为了更好的可读性，重命名 yaml 配置：steps_num 改为 horizon_length 和 lr_threshold 改为 kl_threshold\n\n\n\n## 故障排除\n\n* 部分支持的环境未通过 setup.py 安装，您需要手动安装它们\n* 从 rl-games 1.1.0 开始，旧的 yaml 配置将不兼容新版本： \n    * `steps_num` 应更改为 `horizon_length` 且 `lr_threshold` 更改为 `kl_threshold`\n\n## 已知问题\n\n* 使用 Isaac Gym 运行单个环境可能导致崩溃，如果发生这种情况，请切换到至少并行模拟 2 个环境","# rl_games 快速上手指南\n\n**rl_games** 是一个基于 PyTorch 的高性能强化学习库，支持端到端 GPU 加速训练，适用于机器人仿真（如 Isaac Gym）、游戏（如 Atari）及多智能体场景。\n\n## 环境准备\n\n- **Python**: 建议使用 Python 3.8 及以上版本。\n- **CUDA**: 为获得最佳训练性能，推荐使用 PyTorch 2.2+ 配合 CUDA 12.1+。\n- **NumPy 重要提示**: 若计划使用 `EnvPool` 进行高性能模拟，**必须**使用 NumPy 1.x 版本。NumPy 2.0+ 与 EnvPool 不兼容，会导致训练失败。\n\n## 安装步骤\n\n### 1. 安装核心依赖\n\n首先安装 PyTorch 及相关组件：\n\n```bash\npip3 install torch torchvision\n```\n\n然后安装 rl_games 主包：\n\n```bash\npip install rl-games\n```\n\n如需从源码安装最新版本：\n\n```bash\npip install -e .\n```\n\n### 2. 安装运行环境依赖\n\n根据目标环境安装相应的依赖库：\n\n- **通用 CPU 环境** (需 EnvPool 或 Ray):\n  ```bash\n  pip install envpool\n  # 或\n  pip install ray\n  ```\n\n- **Mujoco \u002F Atari \u002F Box2d 环境**:\n  ```bash\n  pip install gym[mujoco]\n  pip install gym[atari]\n  pip install gym[box2d]\n  ```\n\n- **Atari 额外依赖**:\n  ```bash\n  pip install opencv-python\n  # 现代 Gymnasium\u002FALE Atari 环境\n  pip install ale-py\n  ```\n\n- **EnvPool + NumPy 兼容性修复** (若使用 EnvPool):\n  如果安装了 EnvPool 且遇到 NumPy 2.0 冲突，请降级 NumPy：\n  ```bash\n  pip uninstall numpy\n  pip install numpy==1.26.4\n  ```\n\n## 基本使用\n\n以下以 **Atari Pong** 环境为例，展示如何使用 `runner.py` 进行训练和测试。确保已安装 `gym[atari]` 和 `envpool`。\n\n### 训练模型\n\n```bash\npython runner.py --train --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_pong_envpool.yaml\n```\n\n### 测试模型\n\n```bash\npython runner.py --play --file rl_games\u002Fconfigs\u002Fatari\u002Fppo_pong_envpool.yaml --checkpoint nn\u002FPong-v5_envpool.pth\n```\n\n### 其他环境示例\n\n- **NVIDIA Isaac Gym (Ant)**:\n  ```bash\n  python train.py task=Ant headless=True\n  python train.py task=Ant test=True checkpoint=nn\u002FAnt.pth num_envs=100\n  ```\n\n- **Brax (Ant)**:\n  ```bash\n  pip install -U \"jax[cuda12]\"\n  pip install brax\n  python runner.py --train --file rl_games\u002Fconfigs\u002Fbrax\u002Fppo_ant.yaml\n  python runner.py --play --file rl_games\u002Fconfigs\u002Fbrax\u002Fppo_ant.yaml --checkpoint runs\u002FAnt_brax\u002Fnn\u002FAnt_brax.pth\n  ```","某机器人初创公司的算法团队正在开发灵巧手操作任务，需要在仿真环境中快速训练策略并迁移到真机。\n\n### 没有 rl_games 时\n- 需要从零编写强化学习算法代码，调试 PPO 等策略耗时耗力，容易引入 Bug。\n- CPU 并行环境数量受限，单卡训练效率低，模型收敛慢，迭代周期长达数周。\n- 仿真环境与算法框架耦合紧密，更换物理引擎（如从 Mujoco 换到 Isaac Gym）需重构大量代码。\n- 缺乏多智能体协作支持，难以模拟灵巧手指间复杂的协同控制动作。\n\n### 使用 rl_games 后\n- rl_games 内置成熟的 PPO 实现，直接调用即可开始训练，节省大量基础编码时间。\n- 利用 GPU 加速和 EnvPool 技术，单卡可并行运行数千个环境，训练速度提升数十倍。\n- 原生支持 Isaac Gym 和 Brax，无缝切换不同仿真后端，无需修改核心逻辑。\n- 提供多智能体训练接口，轻松实现灵巧手指间的协同控制策略，加速 Sim-to-Real 迁移。\n\nrl_games 通过高性能 GPU 并行与成熟算法库，将机器人策略训练周期从数周缩短至数天。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDenys88_rl_games_b03974ed.gif","Denys88","Denys Makoviichuk","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FDenys88_dc586c0a.png",null,"Los Angeles","DenysM88","https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fdenys-makoviichuk-2219a72b","https:\u002F\u002Fgithub.com\u002FDenys88",[84,88],{"name":85,"color":86,"percentage":87},"Jupyter Notebook","#DA5B0B",98.2,{"name":89,"color":90,"percentage":91},"Python","#3572A5",1.8,1318,205,"2026-04-05T08:15:14","MIT","未说明","推荐 NVIDIA GPU，CUDA 12.1+ (最高性能)",{"notes":99,"python":96,"dependencies":100},"1. 使用 EnvPool 时 NumPy 必须为 1.x 版本 (如 1.26.4)，与 NumPy 2.0+ 不兼容。2. 若要运行 Isaac Gym 任务，需单独下载并安装 Isaac Gym 及 IsaacGymEnvs。3. 支持多 GPU 训练 (通过 torchrun 编排)。4. 实验追踪集成 Weights and Biases。5. 推荐使用 poetry 进行开发环境配置。",[101,102,103,104,105,106,107,108,109,110],"torch>=2.2","torchvision","gym","envpool","ray","jax[cuda12]","brax","opencv-python","ale-py","numpy",[13],[113,114,115],"deep-learning","pytorch","reinforcement-learning","2026-03-27T02:49:30.150509","2026-04-06T06:46:20.432809",[119,124,129,134,139,144],{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},3179,"如何从 Omni Isaac Gym 导出策略到 ONNX 并验证一致性？","使用 rlgames_export.py 脚本，恢复默认 checkpoint 作为权重。验证时需注意 Isaac Sim 的输出通常被 clip 到 [-1, 1]，而 ONNX 模型输出未 clip。可通过对比输入数据和 Agent 输出的 Tensor 数值来确认导出成功（例如输入 float32 数组，对比 Actions 输出）。","https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games\u002Fissues\u002F226",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},3180,"PPO 算法中的 Value Normalization 是否必须？对 SAC 有效吗？","对于 PPO，Value Normalization 非常重要，禁用后在某些任务（如 ShadowHand）上会导致失败，因为它用于计算 Loss 时的归一化。SAC 也可以在没有 running mean std 的情况下工作，但需要调整超参数。MAPPO 论文中也提及了该技巧。","https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games\u002Fissues\u002F182",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},3181,"如何在多 GPU 环境下同时进行仿真和训练？","已测试单节点最多 8 张 GPU，每个 Isaac Gym 实例运行在独立的 GPU 上。Horovod 可能默认只使用一个设备，建议使用较新版本（如 v1.1.4）以修复多 GPU 相关的变量初始化等问题。","https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games\u002Fissues\u002F95",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},3182,"如何在 rl_games 外部加载.pth 检查点进行推理？","方法一：修改 rl_games 代码，每次保存 checkpoint 时自动导出 ONNX 模型。方法二：使用默认 Notebook，将 \"train\": True 改为 false，添加 \"run\": True 并指定你的 checkpoint 路径。最快的方式可能是微调 Omniverse 代码以支持直接导出。","https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games\u002Fissues\u002F243",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},3183,"如何将模型导出为 C++ 兼容的 TorchScript 模块？","目前直接通过 check_trace 检查可能失败。可以尝试使用 check_trace=False 参数进行导出，已有用户反馈此方法可行。","https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games\u002Fissues\u002F92",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},3184,"如何将 Isaac Gym 训练的模型应用到真实机器人硬件？","由于复现网络架构较为复杂，维护者建议通过邮件联系获取专用仓库链接。请在维护者个人主页查找邮箱地址并发送请求，通常会收到包含具体实现方案的回复。","https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games\u002Fissues\u002F266",[150,155,159,164,168],{"id":151,"version":152,"summary_zh":153,"released_at":154},102683,"v1.6.5","## rl_games v1.6.5\n\n### Changes\n- Migrated build system to pyproject.toml (poetry-core)\n- Removed setup.py\n- Removed opencv-python from core dependencies\n- Made wandb an optional extra (`pip install rl-games[wandb]`)\n- Added watchdog as a core dependency\n- Updated torch version constraints for Python 3.9+\n- Added classifiers, license, and project URLs to package metadata\n\n### Install\n```\npip install rl-games==1.6.5\n```","2026-02-20T06:23:51",{"id":156,"version":157,"summary_zh":78,"released_at":158},102684,"v1.6.1","2023-10-06T19:09:41",{"id":160,"version":161,"summary_zh":162,"released_at":163},102685,"v1.6.0","- Added ONNX export colab example for discrete and continuous action spaces. For continuous case LSTM policy example is provided as well.\r\n- Improved RNNs training in continuous space, added option zero_rnn_on_done.\r\n- Added NVIDIA CuLE support: https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fcule\r\n- Added player config override. Vecenv is used for inference.\r\n- Fixed multi-gpu training with central value.\r\n- Fixed max_frames termination condition, and it's interaction with the linear learning rate: https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games\u002Fissues\u002F212\r\n- Fixed \"deterministic\" misspelling issue.\r\n- Fixed Mujoco and Brax SAC configs.\r\n- Fixed multiagent envs statistics reporting. Fixed Starcraft2 SMAC environments.\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDenys88\u002Frl_games\u002Fcompare\u002Fv1.5.2...v1.6.0","2023-02-21T08:19:26",{"id":165,"version":166,"summary_zh":78,"released_at":167},102686,"v1.0-alpha2","2020-10-17T21:13:30",{"id":169,"version":170,"summary_zh":78,"released_at":171},102687,"v1.0-alpha","2020-10-17T21:11:02"]