[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-aws--sagemaker-training-toolkit":3,"tool-aws--sagemaker-training-toolkit":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",151918,2,"2026-04-12T11:33:05",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":77,"owner_email":78,"owner_twitter":77,"owner_website":79,"owner_url":80,"languages":81,"stars":98,"forks":99,"last_commit_at":100,"license":101,"difficulty_score":10,"env_os":102,"env_gpu":103,"env_ram":103,"env_deps":104,"category_tags":110,"github_topics":111,"view_count":32,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":117,"updated_at":118,"faqs":119,"releases":147},6859,"aws\u002Fsagemaker-training-toolkit","sagemaker-training-toolkit","Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.","sagemaker-training-toolkit 是一款专为 Amazon SageMaker 设计的开源库，旨在帮助开发者轻松构建自定义的机器学习训练环境。它的核心功能是让任何标准的 Docker 容器都能无缝兼容 SageMaker 的训练服务，从而简化模型从开发到部署的流程。\n\n在机器学习工程中，确保训练环境的一致性往往是个挑战。sagemaker-training-toolkit 有效解决了这一问题：它允许用户将自己的训练脚本和依赖项打包进 Docker 容器，并在其中自动处理入口点执行、超参数解析及环境变量配置等繁琐细节。这意味着用户无需重复编写底层适配代码，即可在隔离且稳定的环境中运行训练任务。\n\n这款工具非常适合需要灵活定制训练环境的机器学习工程师、数据科学家以及算法研究人员。如果你不满足于预置框架，希望使用特定的系统库或非标准依赖进行模型训练，sagemaker-training-toolkit 将是理想选择。其技术亮点在于极简的集成方式——只需在 Dockerfile 中安装该库，并将脚本置于指定目录，即可通过简单的环境变量（如 `SAGEMAKER_PROGRA","sagemaker-training-toolkit 是一款专为 Amazon SageMaker 设计的开源库，旨在帮助开发者轻松构建自定义的机器学习训练环境。它的核心功能是让任何标准的 Docker 容器都能无缝兼容 SageMaker 的训练服务，从而简化模型从开发到部署的流程。\n\n在机器学习工程中，确保训练环境的一致性往往是个挑战。sagemaker-training-toolkit 有效解决了这一问题：它允许用户将自己的训练脚本和依赖项打包进 Docker 容器，并在其中自动处理入口点执行、超参数解析及环境变量配置等繁琐细节。这意味着用户无需重复编写底层适配代码，即可在隔离且稳定的环境中运行训练任务。\n\n这款工具非常适合需要灵活定制训练环境的机器学习工程师、数据科学家以及算法研究人员。如果你不满足于预置框架，希望使用特定的系统库或非标准依赖进行模型训练，sagemaker-training-toolkit 将是理想选择。其技术亮点在于极简的集成方式——只需在 Dockerfile 中安装该库，并将脚本置于指定目录，即可通过简单的环境变量（如 `SAGEMAKER_PROGRAM`）定义训练入口，迅速将本地代码转化为可在云端大规模运行的训练作业。","![SageMaker](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Faws_sagemaker-training-toolkit_readme_b1d25cfd629c.png)\n\n# SageMaker Training Toolkit\n\n[![Latest Version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fsagemaker-training.svg)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fsagemaker-training) [![Supported Python Versions](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fsagemaker-training.svg)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fsagemaker-training) [![Code Style: Black](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcode_style-black-000000.svg)](https:\u002F\u002Fgithub.com\u002Fpython\u002Fblack)\n\nTrain machine learning models within a Docker container using Amazon SageMaker.\n\n\n## :books: Background\n\n[Amazon SageMaker](https:\u002F\u002Faws.amazon.com\u002Fsagemaker\u002F) is a fully managed service for data science and machine learning (ML) workflows.\nYou can use Amazon SageMaker to simplify the process of building, training, and deploying ML models.\n\nTo train a model, you can include your training script and dependencies in a [Docker container](https:\u002F\u002Fwww.docker.com\u002Fresources\u002Fwhat-container) that runs your training code.\nA container provides an effectively isolated environment, ensuring a consistent runtime and reliable training process. \n\nThe **SageMaker Training Toolkit** can be easily added to any Docker container, making it compatible with SageMaker for [training models](https:\u002F\u002Faws.amazon.com\u002Fsagemaker\u002Ftrain\u002F).\nIf you use a [prebuilt SageMaker Docker image for training](https:\u002F\u002Fdocs.aws.amazon.com\u002Fsagemaker\u002Flatest\u002Fdg\u002Fpre-built-containers-frameworks-deep-learning.html), this library may already be included.\n\nFor more information, see the Amazon SageMaker Developer Guide sections on [using Docker containers for training](https:\u002F\u002Fdocs.aws.amazon.com\u002Fsagemaker\u002Flatest\u002Fdg\u002Fyour-algorithms.html).\n\n## :hammer_and_wrench: Installation\n\nTo install this library in your Docker image, add the following line to your [Dockerfile](https:\u002F\u002Fdocs.docker.com\u002Fengine\u002Freference\u002Fbuilder\u002F):\n\n``` dockerfile\nRUN pip3 install sagemaker-training\n```\n\n## :computer: Usage\n\nThe following are brief how-to guides.\nFor complete, working examples of custom training containers built with the SageMaker Training Toolkit, please see [the example notebooks](https:\u002F\u002Fgithub.com\u002Fawslabs\u002Famazon-sagemaker-examples\u002Ftree\u002Fmaster\u002Fadvanced_functionality\u002Fcustom-training-containers).\n\n### Create a Docker image and train a model\n\n1. Write a training script (eg. `train.py`).\n\n2. [Define a container with a Dockerfile](https:\u002F\u002Fdocs.docker.com\u002Fget-started\u002Fpart2\u002F#define-a-container-with-dockerfile) that includes the training script and any dependencies.\n\n    The training script must be located in the `\u002Fopt\u002Fml\u002Fcode` directory.\n    The environment variable `SAGEMAKER_PROGRAM` defines which file inside the `\u002Fopt\u002Fml\u002Fcode` directory to use as the training entry point.\n    When training starts, the interpreter executes the entry point defined by `SAGEMAKER_PROGRAM`.\n    Python and shell scripts are both supported.\n    \n    ``` docker\n    FROM yourbaseimage:tag\n  \n    # install the SageMaker Training Toolkit \n    RUN pip3 install sagemaker-training\n\n    # copy the training script inside the container\n    COPY train.py \u002Fopt\u002Fml\u002Fcode\u002Ftrain.py\n\n    # define train.py as the script entry point\n    ENV SAGEMAKER_PROGRAM train.py\n    ```\n\n3. Build and tag the Docker image.\n\n    ``` shell\n    docker build -t custom-training-container .\n    ```\n\n4. Use the Docker image to start a training job using the [SageMaker Python SDK](https:\u002F\u002Fgithub.com\u002Faws\u002Fsagemaker-python-sdk).\n\n    ``` python\n    from sagemaker.estimator import Estimator\n\n    estimator = Estimator(image_name=\"custom-training-container\",\n                          role=\"SageMakerRole\",\n                          train_instance_count=1,\n                          train_instance_type=\"local\")\n\n    estimator.fit()\n    ```\n    \n    To train a model using the image on SageMaker, [push the image to ECR](https:\u002F\u002Fdocs.aws.amazon.com\u002FAmazonECR\u002Flatest\u002Fuserguide\u002Fdocker-push-ecr-image.html) and start a SageMaker training job with the image URI.\n    \n\n### Pass arguments to the entry point using hyperparameters\n\nAny hyperparameters provided by the training job are passed to the entry point as script arguments.\nThe SageMaker Python SDK uses this feature to pass special hyperparameters to the training job, including `sagemaker_program` and `sagemaker_submit_directory`.\nThe complete list of SageMaker hyperparameters is available [here](https:\u002F\u002Fgithub.com\u002Faws\u002Fsagemaker-training-toolkit\u002Fblob\u002Fmaster\u002Fsrc\u002Fsagemaker_training\u002Fparams.py).\n\n1. Implement an argument parser in the entry point script. For example, in a Python script:\n\n    ``` python\n    import argparse\n\n    if __name__ == \"__main__\":\n      parser = argparse.ArgumentParser()\n\n      parser.add_argument(\"--learning-rate\", type=int, default=1)\n      parser.add_argument(\"--batch-size\", type=int, default=64)\n      parser.add_argument(\"--communicator\", type=str)\n      parser.add_argument(\"--frequency\", type=int, default=20)\n\n      args = parser.parse_args()\n      ...\n    ```\n\n2. Start a training job with hyperparameters.\n\n    ``` python\n    {\"HyperParameters\": {\"batch-size\": 256, \"learning-rate\": 0.0001, \"communicator\": \"pure_nccl\"}}\n    ```\n\n### Read additional information using environment variables\n\nAn entry point often needs additional information not available in `hyperparameters`.\nThe SageMaker Training Toolkit writes this information as environment variables that are available from within the script.\nFor example, this training job includes the channels `training` and `testing`:\n\n``` python\nfrom sagemaker.pytorch import PyTorch\n\nestimator = PyTorch(entry_point=\"train.py\", ...)\n\nestimator.fit({\"training\": \"s3:\u002F\u002Fbucket\u002Fpath\u002Fto\u002Ftraining\u002Fdata\", \n               \"testing\": \"s3:\u002F\u002Fbucket\u002Fpath\u002Fto\u002Ftesting\u002Fdata\"})\n```\n\nThe environment variables `SM_CHANNEL_TRAINING` and `SM_CHANNEL_TESTING` provide the paths to the channels:\n\n``` python\nimport argparse\nimport os\n\nif __name__ == \"__main__\":\n  parser = argparse.ArgumentParser()\n\n  ...\n\n  # reads input channels training and testing from the environment variables\n  parser.add_argument(\"--training\", type=str, default=os.environ[\"SM_CHANNEL_TRAINING\"])\n  parser.add_argument(\"--testing\", type=str, default=os.environ[\"SM_CHANNEL_TESTING\"])\n\n  args = parser.parse_args()\n\n  ...\n```\n\nWhen training starts, SageMaker Training Toolkit will print all available environment variables. Please see the [reference on environment variables](https:\u002F\u002Fgithub.com\u002Faws\u002Fsagemaker-training-toolkit\u002Fblob\u002Fmaster\u002FENVIRONMENT_VARIABLES.md) for a full list of provided environment variables.\n\n### Get information about the container environment\n\nTo get information about the container environment, initialize an `Environment` object.\n`Environment` provides access to aspects of the environment relevant to training jobs, including hyperparameters, system characteristics, filesystem locations, environment variables and configuration settings.\nIt is a read-only snapshot of the container environment during training, and it doesn't contain any form of state.\n\n``` python\nfrom sagemaker_training import environment\n\nenv = environment.Environment()\n\n# get the path of the channel \"training\" from the `inputdataconfig.json` file\ntraining_dir = env.channel_input_dirs[\"training\"]\n\n# get a the hyperparameter \"training_data_file\" from `hyperparameters.json` file\nfile_name = env.hyperparameters[\"training_data_file\"]\n\n# get the folder where the model should be saved\nmodel_dir = env.model_dir\n\n# train the model\ndata = np.load(os.path.join(training_dir, file_name))\nx_train, y_train = data[\"features\"], keras.utils.to_categorical(data[\"labels\"])\nmodel = ResNet50(weights=\"imagenet\")\n...\nmodel.fit(x_train, y_train)\n\n#save the model to the model_dir at the end of training\nmodel.save(os.path.join(model_dir, \"saved_model\"))\n```\n\n### Execute the entry point\n\nTo execute the entry point, call `entry_point.run()`.\n\n``` python\nfrom sagemaker_training import entry_point, environment\n\nenv = environment.Environment()\n\n# read hyperparameters as script arguments\nargs = env.to_cmd_args()\n\n# get the environment variables\nenv_vars = env.to_env_vars()\n\n# execute the entry point\nentry_point.run(uri=env.module_dir,\n                user_entry_point=env.user_entry_point,\n                args=args,\n                env_vars=env_vars)\n\n```\n\nIf the entry point execution fails, `trainer.train()` will write the error message to `\u002Fopt\u002Fml\u002Foutput\u002Ffailure`. Otherwise, it will write to the file `\u002Fopt\u002Fml\u002Fsuccess`.\n\n## :scroll: License\n\nThis library is licensed under the [Apache 2.0 License](http:\u002F\u002Faws.amazon.com\u002Fapache2.0\u002F).\nFor more details, please take a look at the [LICENSE](https:\u002F\u002Fgithub.com\u002Faws\u002Fsagemaker-training-toolkit\u002Fblob\u002Fmaster\u002FLICENSE) file.\n\n## :handshake: Contributing\n\nContributions are welcome!\nPlease read our [contributing guidelines](https:\u002F\u002Fgithub.com\u002Faws\u002Fsagemaker-training-toolkit\u002Fblob\u002Fmaster\u002FCONTRIBUTING.md)\nif you'd like to open an issue or submit a pull request.\n","![SageMaker](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Faws_sagemaker-training-toolkit_readme_b1d25cfd629c.png)\n\n# SageMaker 训练工具包\n\n[![最新版本](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fsagemaker-training.svg)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fsagemaker-training) [![支持的 Python 版本](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fsagemaker-training.svg)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fsagemaker-training) [![代码风格：Black](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcode_style-black-000000.svg)](https:\u002F\u002Fgithub.com\u002Fpython\u002Fblack)\n\n使用 Amazon SageMaker 在 Docker 容器中训练机器学习模型。\n\n\n## :books: 背景\n\n[Amazon SageMaker](https:\u002F\u002Faws.amazon.com\u002Fsagemaker\u002F) 是一项完全托管的数据科学和机器学习 (ML) 工作流服务。\n您可以使用 Amazon SageMaker 来简化构建、训练和部署 ML 模型的过程。\n\n要训练模型，您可以将训练脚本和依赖项包含在一个运行您的训练代码的 [Docker 容器](https:\u002F\u002Fwww.docker.com\u002Fresources\u002Fwhat-container) 中。\n容器提供了一个有效的隔离环境，确保一致的运行时和可靠的训练过程。\n\n**SageMaker 训练工具包** 可以轻松添加到任何 Docker 容器中，使其与 SageMaker 兼容，用于 [训练模型](https:\u002F\u002Faws.amazon.com\u002Fsagemaker\u002Ftrain\u002F)。\n如果您使用 [SageMaker 预建训练用 Docker 镜像](https:\u002F\u002Fdocs.aws.amazon.com\u002Fsagemaker\u002Flatest\u002Fdg\u002Fpre-built-containers-frameworks-deep-learning.html)，则此库可能已包含在内。\n\n有关更多信息，请参阅 Amazon SageMaker 开发人员指南中关于 [使用 Docker 容器进行训练](https:\u002F\u002Fdocs.aws.amazon.com\u002Fsagemaker\u002Flatest\u002Fdg\u002Fyour-algorithms.html) 的部分。\n\n## :hammer_and_wrench: 安装\n\n要在您的 Docker 镜像中安装此库，请将以下行添加到您的 [Dockerfile](https:\u002F\u002Fdocs.docker.com\u002Fengine\u002Freference\u002Fbuilder\u002F)：\n\n``` dockerfile\nRUN pip3 install sagemaker-training\n```\n\n## :computer: 使用\n\n以下是简要的操作指南。\n有关使用 SageMaker 训练工具包构建的自定义训练容器的完整可运行示例，请参阅 [示例笔记本](https:\u002F\u002Fgithub.com\u002Fawslabs\u002Famazon-sagemaker-examples\u002Ftree\u002Fmaster\u002Fadvanced_functionality\u002Fcustom-training-containers)。\n\n### 创建 Docker 镜像并训练模型\n\n1. 编写训练脚本（例如 `train.py`）。\n\n2. [使用 Dockerfile 定义一个包含训练脚本及所有依赖项的容器](https:\u002F\u002Fdocs.docker.com\u002Fget-started\u002Fpart2\u002F#define-a-container-with-dockerfile)。\n\n    训练脚本必须位于 `\u002Fopt\u002Fml\u002Fcode` 目录中。\n    环境变量 `SAGEMAKER_PROGRAM` 定义了 `\u002Fopt\u002Fml\u002Fcode` 目录中哪个文件作为训练入口点。\n    当训练开始时，解释器会执行由 `SAGEMAKER_PROGRAM` 定义的入口点。\n    Python 和 Shell 脚本均受支持。\n    \n    ``` docker\n    FROM yourbaseimage:tag\n  \n    # 安装 SageMaker 训练工具包\n    RUN pip3 install sagemaker-training\n\n    # 将训练脚本复制到容器内\n    COPY train.py \u002Fopt\u002Fml\u002Fcode\u002Ftrain.py\n\n    # 将 train.py 定义为脚本入口点\n    ENV SAGEMAKER_PROGRAM train.py\n    ```\n\n3. 构建并标记 Docker 镜像。\n\n    ``` shell\n    docker build -t custom-training-container .\n    ```\n\n4. 使用该 Docker 镜像通过 [SageMaker Python SDK](https:\u002F\u002Fgithub.com\u002Faws\u002Fsagemaker-python-sdk) 启动训练任务。\n\n    ``` python\n    from sagemaker.estimator import Estimator\n\n    estimator = Estimator(image_name=\"custom-training-container\",\n                          role=\"SageMakerRole\",\n                          train_instance_count=1,\n                          train_instance_type=\"local\")\n\n    estimator.fit()\n    ```\n    \n    要在 SageMaker 上使用该镜像训练模型，需先将镜像 [推送到 ECR](https:\u002F\u002Fdocs.aws.amazon.com\u002FAmazonECR\u002Flatest\u002Fuserguide\u002Fdocker-push-ecr-image.html)，然后使用镜像 URI 启动 SageMaker 训练作业。\n\n\n### 使用超参数向入口点传递参数\n\n训练任务提供的任何超参数都会作为脚本参数传递给入口点。\nSageMaker Python SDK 利用此功能将特殊超参数传递给训练任务，包括 `sagemaker_program` 和 `sagemaker_submit_directory`。\n完整的 SageMaker 超参数列表可在 [此处](https:\u002F\u002Fgithub.com\u002Faws\u002Fsagemaker-training-toolkit\u002Fblob\u002Fmaster\u002Fsrc\u002Fsagemaker_training\u002Fparams.py) 查看。\n\n1. 在入口点脚本中实现参数解析器。例如，在 Python 脚本中：\n\n    ``` python\n    import argparse\n\n    if __name__ == \"__main__\":\n      parser = argparse.ArgumentParser()\n\n      parser.add_argument(\"--learning-rate\", type=int, default=1)\n      parser.add_argument(\"--batch-size\", type=int, default=64)\n      parser.add_argument(\"--communicator\", type=str)\n      parser.add_argument(\"--frequency\", type=int, default=20)\n\n      args = parser.parse_args()\n      ...\n    ```\n\n2. 带超参数启动训练任务。\n\n    ``` python\n    {\"HyperParameters\": {\"batch-size\": 256, \"learning-rate\": 0.0001, \"communicator\": \"pure_nccl\"}}\n    ```\n\n### 使用环境变量读取附加信息\n\n入口点通常需要一些无法从 `hyperparameters` 中获取的额外信息。\nSageMaker 训练工具包会将这些信息写入环境变量，供脚本内部使用。\n例如，此训练任务包含 `training` 和 `testing` 两个数据通道：\n\n``` python\nfrom sagemaker.pytorch import PyTorch\n\nestimator = PyTorch(entry_point=\"train.py\", ...)\n\nestimator.fit({\"training\": \"s3:\u002F\u002Fbucket\u002Fpath\u002Fto\u002Ftraining\u002Fdata\", \n               \"testing\": \"s3:\u002F\u002Fbucket\u002Fpath\u002Fto\u002Ftesting\u002Fdata\"})\n```\n\n环境变量 `SM_CHANNEL_TRAINING` 和 `SM_CHANNEL_TESTING` 提供了这些通道的路径：\n\n``` python\nimport argparse\nimport os\n\nif __name__ == \"__main__\":\n  parser = argparse.ArgumentParser()\n\n  ...\n\n  # 从环境变量中读取输入通道 training 和 testing\n  parser.add_argument(\"--training\", type=str, default=os.environ[\"SM_CHANNEL_TRAINING\"])\n  parser.add_argument(\"--testing\", type=str, default=os.environ[\"SM_CHANNEL_TESTING\"])\n\n  args = parser.parse_args()\n\n  ...\n```\n\n训练开始时，SageMaker 训练工具包会打印所有可用的环境变量。有关提供的完整环境变量列表，请参阅 [环境变量参考文档](https:\u002F\u002Fgithub.com\u002Faws\u002Fsagemaker-training-toolkit\u002Fblob\u002Fmaster\u002FENVIRONMENT_VARIABLES.md)。\n\n### 获取容器环境信息\n\n要获取容器环境信息，可以初始化一个 `Environment` 对象。`Environment` 提供了对训练作业相关环境方面的访问权限，包括超参数、系统特性、文件系统路径、环境变量和配置设置等。它是在训练期间对容器环境的只读快照，并不包含任何状态信息。\n\n```python\nfrom sagemaker_training import environment\n\nenv = environment.Environment()\n\n# 从 `inputdataconfig.json` 文件中获取通道 \"training\" 的路径\ntraining_dir = env.channel_input_dirs[\"training\"]\n\n# 从 `hyperparameters.json` 文件中获取超参数 \"training_data_file\"\nfile_name = env.hyperparameters[\"training_data_file\"]\n\n# 获取模型应保存的文件夹\nmodel_dir = env.model_dir\n\n# 训练模型\ndata = np.load(os.path.join(training_dir, file_name))\nx_train, y_train = data[\"features\"], keras.utils.to_categorical(data[\"labels\"])\nmodel = ResNet50(weights=\"imagenet\")\n...\nmodel.fit(x_train, y_train)\n\n# 在训练结束时将模型保存到 model_dir\nmodel.save(os.path.join(model_dir, \"saved_model\"))\n```\n\n### 执行入口点\n\n要执行入口点，调用 `entry_point.run()` 即可。\n\n```python\nfrom sagemaker_training import entry_point, environment\n\nenv = environment.Environment()\n\n# 将超参数作为脚本参数读取\nargs = env.to_cmd_args()\n\n# 获取环境变量\nenv_vars = env.to_env_vars()\n\n# 执行入口点\nentry_point.run(uri=env.module_dir,\n                user_entry_point=env.user_entry_point,\n                args=args,\n                env_vars=env_vars)\n```\n\n如果入口点执行失败，`trainer.train()` 会将错误信息写入 `\u002Fopt\u002Fml\u002Foutput\u002Ffailure`。否则，它会将成功信息写入 `\u002Fopt\u002Fml\u002Fsuccess` 文件。\n\n## :scroll: 许可证\n\n本库采用 [Apache 2.0 许可证](http:\u002F\u002Faws.amazon.com\u002Fapache2.0\u002F) 授权。更多详情请参阅 [LICENSE](https:\u002F\u002Fgithub.com\u002Faws\u002Fsagemaker-training-toolkit\u002Fblob\u002Fmaster\u002FLICENSE) 文件。\n\n## :handshake: 贡献\n\n欢迎贡献！如果您想提交问题或拉取请求，请阅读我们的 [贡献指南](https:\u002F\u002Fgithub.com\u002Faws\u002Fsagemaker-training-toolkit\u002Fblob\u002Fmaster\u002FCONTRIBUTING.md)。","# SageMaker Training Toolkit 快速上手指南\n\nSageMaker Training Toolkit 是一个用于在 Docker 容器内运行 Amazon SageMaker 训练任务的工具库。它允许你将自定义的训练脚本和依赖打包进容器，并使其兼容 SageMaker 的训练环境。\n\n## 环境准备\n\n*   **系统要求**：支持 Docker 的操作系统（Linux, macOS, Windows with WSL2）。\n*   **前置依赖**：\n    *   Python 3.6 或更高版本。\n    *   Docker Engine 已安装并运行。\n    *   AWS CLI 已配置（用于推送镜像到 ECR）。\n    *   `sagemaker` Python SDK（用于本地测试或提交训练任务）。\n\n## 安装步骤\n\n要将此库集成到你的自定义训练容器中，请在你的 `Dockerfile` 中添加以下安装命令：\n\n```dockerfile\nRUN pip3 install sagemaker-training\n```\n\n> **提示**：如果你使用的是 AWS 提供的预构建 SageMaker 深度学习镜像，该库通常已预先安装，无需重复添加。\n\n## 基本使用\n\n以下是构建自定义训练容器并运行模型的最简流程：\n\n### 1. 编写训练脚本\n创建一个 Python 脚本（例如 `train.py`），作为训练的入口点。\n\n```python\n# train.py\nif __name__ == \"__main__\":\n    print(\"Training started...\")\n    # 在此处添加你的模型训练逻辑\n    # 可以通过 os.environ 读取 SageMaker 注入的环境变量\n    # 可以通过 argparse 读取超参数\n```\n\n### 2. 创建 Dockerfile\n编写 `Dockerfile`，将基础镜像、工具库和训练脚本打包在一起。**注意**：脚本必须位于 `\u002Fopt\u002Fml\u002Fcode` 目录，并通过 `SAGEMAKER_PROGRAM` 环境变量指定入口文件。\n\n```dockerfile\nFROM python:3.8-slim\n\n# 安装 SageMaker Training Toolkit\nRUN pip3 install sagemaker-training\n\n# 将训练脚本复制到容器内的指定目录\nCOPY train.py \u002Fopt\u002Fml\u002Fcode\u002Ftrain.py\n\n# 定义入口点脚本\nENV SAGEMAKER_PROGRAM train.py\n```\n\n### 3. 构建 Docker 镜像\n在包含 `Dockerfile` 和 `train.py` 的目录下执行构建命令：\n\n```shell\ndocker build -t custom-training-container .\n```\n\n### 4. 运行训练任务\n你可以使用 `sagemaker` Python SDK 在本地模式（`local`）下测试该镜像，或者将其推送到 Amazon ECR 后在云端运行。\n\n**本地测试示例：**\n\n```python\nfrom sagemaker.estimator import Estimator\n\nestimator = Estimator(image_name=\"custom-training-container\",\n                      role=\"SageMakerRole\",\n                      train_instance_count=1,\n                      train_instance_type=\"local\")\n\nestimator.fit()\n```\n\n**云端运行提示：**\n若要在 AWS SageMaker 上运行，请先将镜像推送到 Amazon ECR：\n```shell\ndocker tag custom-training-container \u003Cyour-account-id>.dkr.ecr.\u003Cregion>.amazonaws.com\u002Fcustom-training-container:latest\ndocker push \u003Cyour-account-id>.dkr.ecr.\u003Cregion>.amazonaws.com\u002Fcustom-training-container:latest\n```\n然后使用生成的 ECR 镜像 URI 替换上述 Python 代码中的 `image_name` 并将 `train_instance_type` 设置为实际的云实例类型（如 `ml.m5.large`）。","某金融科技团队需要将自定义的异常检测算法迁移至 Amazon SageMaker 平台，以利用其弹性算力进行大规模历史数据训练。\n\n### 没有 sagemaker-training-toolkit 时\n- 开发人员必须手动编写复杂的 Shell 入口脚本，用于解析 SageMaker 传递的超参数、输入数据路径及输出配置，极易因格式错误导致任务启动失败。\n- 容器内部缺乏标准化的环境变量映射机制，每次调整代码逻辑或依赖库时，都需要反复修改 Dockerfile 中的硬编码路径，维护成本极高。\n- 本地调试环境与云端生产环境不一致，开发者难以在本地 Docker 中模拟真实的 SageMaker 训练行为，导致“本地运行正常，上云即报错”的频繁返工。\n- 缺乏统一的训练入口定义标准，团队成员各自为政，使得代码复用性差，新成员上手定制容器门槛高。\n\n### 使用 sagemaker-training-toolkit 后\n- 只需在 Dockerfile 中安装该工具包并设置 `SAGEMAKER_PROGRAM` 环境变量，即可自动识别并执行指定的 Python 训练脚本，无需手写任何引导代码。\n- 工具自动处理超参数解析、通道（Channel）数据挂载及模型保存路径映射，开发者只需关注核心算法逻辑，彻底解耦了业务代码与基础设施细节。\n- 支持在本地 Docker 容器中完美复现 SageMaker 的训练运行时行为，实现了“一次编写，随处运行”，大幅缩短了从开发到部署的验证周期。\n- 提供了标准化的容器构建规范，团队可快速基于此模板复制出多个不同算法的训练镜像，显著提升了协作效率和工程规范性。\n\nsagemaker-training-toolkit 通过将繁琐的基础设施适配工作自动化，让数据科学家能专注于模型创新而非容器运维。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Faws_sagemaker-training-toolkit_1b330b80.png","aws","Amazon Web Services","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Faws_84ebd8ed.png","",null,"open-source-github@amazon.com","https:\u002F\u002Famazon.com\u002Faws","https:\u002F\u002Fgithub.com\u002Faws",[82,86,90,94],{"name":83,"color":84,"percentage":85},"Python","#3572A5",96,{"name":87,"color":88,"percentage":89},"C","#555555",3.2,{"name":91,"color":92,"percentage":93},"Dockerfile","#384d54",0.7,{"name":95,"color":96,"percentage":97},"Shell","#89e051",0.1,535,140,"2026-04-04T09:48:09","Apache-2.0","Linux","未说明",{"notes":105,"python":106,"dependencies":107},"该工具主要用于在 Docker 容器内运行 Amazon SageMaker 训练任务。必须将训练脚本放置在容器内的 \u002Fopt\u002Fml\u002Fcode 目录，并通过 SAGEMAKER_PROGRAM 环境变量指定入口脚本。支持通过超参数和环境变量（如 SM_CHANNEL_*）传递配置。若使用 SageMaker 预构建镜像，可能已包含此库。在本地测试时需安装 Docker，在云端使用时需将镜像推送至 ECR。","根据 PyPI 徽章支持多种版本，具体需参考 sagemaker-training 包兼容性",[108,109],"sagemaker-training","docker",[14],[73,112,113,109,114,115,116],"sagemaker","training","python","machine-learning","deep-learning","2026-03-27T02:49:30.150509","2026-04-12T20:16:07.788900",[120,125,130,135,139,143],{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},30939,"在 SageMaker 推理服务期间遇到 'modules.import_module' 错误且报错信息不明确，如何调试？","该错误通常发生在自定义容器或模块安装失败时。虽然原始报错堆栈可能未直接显示具体的安装命令失败原因，但可以通过检查 CloudWatch 日志中 `sagemaker-containers` 输出的详细错误信息来定位。如果是自定义 Docker 镜像（如扩展了 PyTorch 官方镜像），请确保：\n1. 自定义分支的代码能正确被 `pip install` 或 `setup.py` 安装。\n2. 入口脚本路径和模块名称配置正确。\n3. 依赖项在构建镜像时已完全安装。\n如果问题依旧，建议在容器启动脚本中增加显式的日志打印，或在本地复现 `docker run` 环境进行调试，因为默认的 `raise error_class` 有时不会完整输出 stderr 中的所有上下文。","https:\u002F\u002Fgithub.com\u002Faws\u002Fsagemaker-training-toolkit\u002Fissues\u002F7",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},30940,"使用 README 中的示例部署模型时，为什么上传到 S3 的模型文件是空的或者结构不一致？","这是一个已知的问题，通常与本地模式（local mode）和云端训练模式的输出路径处理差异有关。\n- **现象**：本地训练时，模型文件可能直接位于 `job_name\u002Fmodel.tar.gz`；而在云端实例（如 ml.c5.xlarge）上运行时，输出可能会被包裹在额外的 `output\u002F` 目录中（即 `job_name\u002Foutput\u002Fmodel.tar.gz`），或者由于权限\u002F路径配置问题导致上传空文件。\n- **解决方案**：\n1. 检查训练脚本中保存模型的路径是否严格遵循 SageMaker 约定的 `\u002Fopt\u002Fml\u002Fmodel` 目录。\n2. 确保在打包模型时（通常是 `tar czf model.tar.gz` 步骤），当前工作目录正确，且没有将父目录错误地打包进去。\n3. 参考官方更新的示例（见关联 Issue #57）以获取最新的最佳实践，避免使用过时的 Docker 构建或推送脚本。","https:\u002F\u002Fgithub.com\u002Faws\u002Fsagemaker-training-toolkit\u002Fissues\u002F13",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},30941,"在 SageMaker 本地模式（Local Mode）下运行自带容器（BYOC）时，为什么无法获取默认的环境变量（如 SM_MODEL_DIR, SM_CHANNEL_TRAIN）？","这是本地模式的一个已知限制或 Bug。在本地运行自定义容器时，SageMaker 训练工具包可能不会像在生产环境中那样自动注入所有的 `SM_*` 环境变量（例如 `SM_MODEL_DIR`, `SM_HPS`, `SM_CHANNEL_TRAIN` 等）。\n- **临时解决方案**：\n1. **手动设置环境变量**：在本地运行脚本或 Docker 启动命令中，显式地导出这些变量。例如：\n   ```bash\n   export SM_MODEL_DIR=\"\u002Fopt\u002Fml\u002Fmodel\"\n   export SM_CHANNEL_TRAIN=\"\u002Fopt\u002Fml\u002Finput\u002Fdata\u002Ftrain\"\n   # 其他变量同理\n   python app.py\n   ```\n2. **代码容错处理**：在 Python 脚本中使用 `os.environ.get('VAR_NAME', 'default_value')` 而不是直接访问，以便在本地调试时提供默认值。\n3. **避免过度依赖 SM_TRAINING_ENV**：对于大型 JSON 配置，不要完全依赖 `SM_TRAINING_ENV` 环境变量，可以考虑通过超参数（hyperparameters）传递关键配置，尽管这对复杂结构不太方便。","https:\u002F\u002Fgithub.com\u002Faws\u002Fsagemaker-training-toolkit\u002Fissues\u002F107",{"id":136,"question_zh":137,"answer_zh":138,"source_url":134},30942,"如何在自定义 SageMaker 容器中正确配置入口点（ENTRYPOINT）以兼容训练工具包？","在构建自带容器（BYOC）时，Dockerfile 的配置至关重要。根据社区经验，推荐的配置方式如下：\n1. **基础镜像选择**：可以使用 `python:3.7` 进行构建阶段，然后使用轻量级镜像（如 `gcr.io\u002Fdistroless\u002Fpython3-debian10`）作为运行阶段，但需确保将所有依赖包（site-packages）正确复制过去。\n2. **依赖安装**：在构建阶段运行 `pip install -r requirements.txt`，并将生成的包复制到最终镜像。\n3. **工作目录**：务必将代码复制到 `\u002Fopt\u002Fml\u002Fcode` 并设置 `WORKDIR \u002Fopt\u002Fml\u002Fcode`。\n4. **入口点**：虽然可以直接指定 `ENTRYPOINT [\"python\", \"app.py\"]`，但为了兼容 SageMaker 的参数解析机制，建议确保 `app.py` 能够正确处理命令行参数（使用 `argparse` 解析 `--model_dir` 等），或者直接调用 `sagemaker_training_toolkit` 的入口函数来自动处理环境注入和模块加载。如果手动解析参数，需注意本地模式下环境变量可能缺失的问题。",{"id":140,"question_zh":141,"answer_zh":142,"source_url":124},30943,"在 SageMaker 批量转换（Batch Transform）任务中遇到模块导入错误，可能的原因是什么？","批量转换任务中出现类似 `importlib.import_module` 或 `modules.import_module` 的错误，通常是因为推理代码依赖的模块在容器环境中不可用或路径配置错误。\n- **常见原因**：\n1. **依赖缺失**：自定义的 `requirements.txt` 未在容器构建过程中正确安装，或者安装的版本与代码不兼容。\n2. **模块路径问题**：代码试图导入的模块不在 `PYTHONPATH` 中。SageMaker 默认会将 `\u002Fopt\u002Fml\u002Fcode` 加入路径，但如果代码结构复杂（如有嵌套包），可能需要调整 `sys.path` 或使用相对导入。\n3. **初始化错误**：某些库（如 fast.ai 或特定版本的 PyTorch\u002FTensorFlow）在初始化时需要特定的环境变量或硬件检测（如 GPU），如果在无 GPU 实例上运行且代码未做兼容处理，也可能导致导入失败。\n- **建议**：检查容器构建日志确认依赖安装成功，并在本地模拟 SageMaker 的目录结构进行测试。",{"id":144,"question_zh":145,"answer_zh":146,"source_url":129},30944,"本地训练与云端训练产生的模型输出路径结构不一致，如何处理？","用户反馈指出，使用 `train_instance_type='local'` 和云端实例（如 `ml.c5.xlarge`）时，S3 输出路径结构存在差异：\n- **本地模式**：通常生成 `s3:\u002F\u002Fbucket\u002Fjob_name\u002Fmodel.tar.gz`。\n- **云端模式**：可能生成 `s3:\u002F\u002Fbucket\u002Fjob_name\u002Foutput\u002Fmodel.tar.gz`。\n这种不一致性会导致后续部署或处理脚本失效。\n- **应对策略**：\n1. **统一读取逻辑**：在下游任务中编写灵活的路径解析逻辑，同时兼容这两种结构（先检查 `output\u002F` 子目录是否存在）。\n2. **显式指定输出行为**：虽然主要取决于 SageMaker 后端实现，但确保训练脚本中将模型保存到 `\u002Fopt\u002Fml\u002Fmodel` 是标准做法，不要自行创建额外的输出目录结构。\n3. **关注官方修复**：此类行为差异被视为潜在的非预期行为，建议关注相关 Issue 的更新，看是否有补丁统一这种行为。",[148,153,158,163,168,173,178,183,188,193,198,203,208,213,218,223,228,233,238,243],{"id":149,"version":150,"summary_zh":151,"released_at":152},222792,"v5.1.1","### 错误修复及其他更改\n\n * 更新 files.py","2025-09-22T16:56:56",{"id":154,"version":155,"summary_zh":156,"released_at":157},222793,"v5.1.0","### 功能\n\n * 添加对 Ultraserver 作业的支持\n\n### 错误修复及其他更改\n\n * 格式化\n * 编译参数在 macOS 上无法正常工作","2025-08-08T19:47:07",{"id":159,"version":160,"summary_zh":161,"released_at":162},222794,"v5.0.0","### 重大变更\n\n * 将 Protocol Buffers 版本更新至 5.28.1","2025-06-04T16:56:43",{"id":164,"version":165,"summary_zh":166,"released_at":167},222795,"v4.9.0","### 功能\n\n * 添加代码所有者文件","2025-02-11T16:56:58",{"id":169,"version":170,"summary_zh":171,"released_at":172},222796,"v4.8.4","### 错误修复及其他更改\n\n * 在创建 \u002Fopt\u002Fml\u002Fcode 时，考虑可能的竞态条件","2025-02-03T16:57:28",{"id":174,"version":175,"summary_zh":176,"released_at":177},222797,"v4.8.3","### 错误修复及其他更改\n\n * 修复失败的单元测试\n * 避免将标准错误输出解析为 JSON","2024-12-09T23:09:07",{"id":179,"version":180,"summary_zh":181,"released_at":182},222798,"v4.8.2","### 错误修复及其他更改\n\n * 为 trn2 暂时硬编码神经元核心","2024-12-06T23:14:41",{"id":184,"version":185,"summary_zh":186,"released_at":187},222799,"v4.8.1","### 错误修复及其他更改\n\n * 添加了 p5 作为支持的 NCCL 实例","2024-09-09T16:55:46",{"id":189,"version":190,"summary_zh":191,"released_at":192},222800,"v4.8.0","### 功能\n\n * 添加对 Python 3.9 和 3.10 的支持\n\n### 错误修复及其他更改\n\n * 运行单元测试命令中的拼写错误\n * 为发布流程按顺序运行单元测试，以防止覆盖率冲突问题\n * 杂项：移除不必要的日志信息","2024-08-14T19:02:36",{"id":194,"version":195,"summary_zh":196,"released_at":197},222801,"v4.7.4","### 错误修复及其他更改\n\n * 更新 boto 依赖，使用最新版本的 boto","2023-10-31T18:03:17",{"id":199,"version":200,"summary_zh":201,"released_at":202},222802,"v4.7.3","### Bug Fixes and Other Changes\n\n * bypass DNS check for studio local exec","2023-10-23T16:46:13",{"id":204,"version":205,"summary_zh":206,"released_at":207},222803,"v4.7.2","### Bug Fixes and Other Changes\n\n * use smddprun only if it is installed","2023-10-19T16:46:00",{"id":209,"version":210,"summary_zh":211,"released_at":212},222804,"v4.7.1","### Bug Fixes and Other Changes\n\n * Add NCCL_PROTO=simple environment variable to handle the out-of-order data delivery from EFA\n * toolkit build failure","2023-10-17T16:46:51",{"id":214,"version":215,"summary_zh":216,"released_at":217},222805,"v4.7.0","### Features\n\n * support codeartifact for installing requirements.txt packages","2023-08-08T16:46:43",{"id":219,"version":220,"summary_zh":221,"released_at":222},222806,"v4.6.1","### Bug Fixes and Other Changes\n\n * removed unused import statment\n * forgot to run black on torch_distributed.py after updating my comments from last commit\n * Modified my comment on line 98-103 in torch_distrbuted.py to comply with formatting standard.\n * Revert \"Ran black on entire sagemaker-trianing-toolkit directory\"\n * Ran black on entire sagemaker-trianing-toolkit directory\n * Ran Black (python formatter) on the files with my code updates (torch_distributed.py and test_torch_distributed.py)\n * Added test for neuron_parallel_compile in test_torch_distributed.py\n * Updated comment syntax based on feedback in pull request as well as added full example of the neuron_parallel_compile command as it would appear in the command line\n * added unit test for neuron_parallel_compile code change\n * Updated torch_distributed.py","2023-06-19T16:46:08",{"id":224,"version":225,"summary_zh":226,"released_at":227},222807,"v4.6.0","### Features\n\n * add smddp exception classes in mpi distribution","2023-06-15T16:46:32",{"id":229,"version":230,"summary_zh":231,"released_at":232},222808,"v4.5.0","### Features\n\n * add NCCL_PROTO, NCCL_ALGO environments for modelparallel jobs","2023-04-26T16:47:04",{"id":234,"version":235,"summary_zh":236,"released_at":237},222809,"v4.4.10","### Bug Fixes and Other Changes\n\n * unpin sagemaker version as the credential issue fixed","2023-04-10T16:46:48",{"id":239,"version":240,"summary_zh":241,"released_at":242},222810,"v4.4.9","### Bug Fixes and Other Changes\n\n * increase worker waiting time for ORTE proc","2023-04-05T16:46:25",{"id":244,"version":245,"summary_zh":246,"released_at":247},222811,"v4.4.8","### Bug Fixes and Other Changes\n\n * upagrade protobuf version for tensorflow 2.12","2023-03-09T21:06:23"]