[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-NVIDIA--k8s-device-plugin":3,"tool-NVIDIA--k8s-device-plugin":62},[4,18,26,36,46,54],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159636,2,"2026-04-17T23:33:34",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":42,"last_commit_at":43,"category_tags":44,"status":17},8272,"opencode","anomalyco\u002Fopencode","OpenCode 是一款开源的 AI 编程助手（Coding Agent），旨在像一位智能搭档一样融入您的开发流程。它不仅仅是一个代码补全插件，而是一个能够理解项目上下文、自主规划任务并执行复杂编码操作的智能体。无论是生成全新功能、重构现有代码，还是排查难以定位的 Bug，OpenCode 都能通过自然语言交互高效完成，显著减少开发者在重复性劳动和上下文切换上的时间消耗。\n\n这款工具专为软件开发者、工程师及技术研究人员设计，特别适合希望利用大模型能力来提升编码效率、加速原型开发或处理遗留代码维护的专业人群。其核心亮点在于完全开源的架构，这意味着用户可以审查代码逻辑、自定义行为策略，甚至私有化部署以保障数据安全，彻底打破了传统闭源 AI 助手的“黑盒”限制。\n\n在技术体验上，OpenCode 提供了灵活的终端界面（Terminal UI）和正在测试中的桌面应用程序，支持 macOS、Windows 及 Linux 全平台。它兼容多种包管理工具，安装便捷，并能无缝集成到现有的开发环境中。无论您是追求极致控制权的资深极客，还是渴望提升产出的独立开发者，OpenCode 都提供了一个透明、可信",144296,1,"2026-04-16T14:50:03",[13,45],"插件",{"id":47,"name":48,"github_repo":49,"description_zh":50,"stars":51,"difficulty_score":32,"last_commit_at":52,"category_tags":53,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":55,"name":56,"github_repo":57,"description_zh":58,"stars":59,"difficulty_score":32,"last_commit_at":60,"category_tags":61,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[45,13,15,14],{"id":63,"github_repo":64,"name":65,"description_en":66,"description_zh":67,"ai_summary_zh":68,"readme_en":69,"readme_zh":70,"quickstart_zh":71,"use_case_zh":72,"hero_image_url":73,"owner_login":74,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":78,"owner_location":78,"owner_email":78,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":101,"forks":102,"last_commit_at":103,"license":104,"difficulty_score":105,"env_os":106,"env_gpu":107,"env_ram":108,"env_deps":109,"category_tags":117,"github_topics":118,"view_count":32,"oss_zip_url":78,"oss_zip_packed_at":78,"status":17,"created_at":120,"updated_at":121,"faqs":122,"releases":152},8676,"NVIDIA\u002Fk8s-device-plugin","k8s-device-plugin","NVIDIA device plugin for Kubernetes","k8s-device-plugin 是 NVIDIA 官方为 Kubernetes 集群打造的 GPU 设备插件，旨在让容器化应用能够轻松调用显卡算力。它以 DaemonSet 形式运行在集群节点上，自动发现并暴露节点上的 GPU 数量，实时监控硬件健康状态，从而允许用户在 Pod 中直接申请和使用 GPU 资源。\n\n在 Kubernetes 原生环境中，调度器默认无法识别 GPU 等特殊硬件，导致人工智能训练、深度学习推理或高性能计算任务难以高效部署。k8s-device-plugin 正是为解决这一痛点而生，它填补了底层驱动与上层编排之间的空白，实现了 GPU 资源的自动化管理与隔离，确保多个任务能安全、稳定地共享显卡资源。\n\n这款工具主要面向运维工程师、AI 开发者及科研人员，特别是那些需要在大规模集群中管理深度学习工作负载的团队。其技术亮点包括支持通过 Helm 图表便捷部署、提供细粒度的配置选项（如基于时间片的 GPU 共享和 CUDA MPS 多进程服务），以及集成 GPU 特性发现功能以自动添加节点标签。作为云原生 AI 基础设施的关键组件，k8s-device-plug","k8s-device-plugin 是 NVIDIA 官方为 Kubernetes 集群打造的 GPU 设备插件，旨在让容器化应用能够轻松调用显卡算力。它以 DaemonSet 形式运行在集群节点上，自动发现并暴露节点上的 GPU 数量，实时监控硬件健康状态，从而允许用户在 Pod 中直接申请和使用 GPU 资源。\n\n在 Kubernetes 原生环境中，调度器默认无法识别 GPU 等特殊硬件，导致人工智能训练、深度学习推理或高性能计算任务难以高效部署。k8s-device-plugin 正是为解决这一痛点而生，它填补了底层驱动与上层编排之间的空白，实现了 GPU 资源的自动化管理与隔离，确保多个任务能安全、稳定地共享显卡资源。\n\n这款工具主要面向运维工程师、AI 开发者及科研人员，特别是那些需要在大规模集群中管理深度学习工作负载的团队。其技术亮点包括支持通过 Helm 图表便捷部署、提供细粒度的配置选项（如基于时间片的 GPU 共享和 CUDA MPS 多进程服务），以及集成 GPU 特性发现功能以自动添加节点标签。作为云原生 AI 基础设施的关键组件，k8s-device-plugin 帮助用户将复杂的硬件管理转化为简单的资源调度请求，极大提升了集群的利用率和开发效率。","# NVIDIA device plugin for Kubernetes\n\n[![End-to-end Tests](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Factions\u002Fworkflows\u002Fe2e.yaml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Factions\u002Fworkflows\u002Fe2e.yaml) [![Go Report Card](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA_k8s-device-plugin_readme_2b4a70945b89.png)](https:\u002F\u002Fgoreportcard.com\u002Freport\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin) [![Latest Release](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002FNVIDIA\u002Fk8s-device-plugin)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Freleases\u002Flatest)\n\n## Table of Contents\n\n- [About](#about)\n- [Prerequisites](#prerequisites)\n- [Quick Start](#quick-start)\n  - [Preparing your GPU Nodes](#preparing-your-gpu-nodes)\n    - [Example for debian-based systems with `docker` and `containerd`](#example-for-debian-based-systems-with-docker-and-containerd)\n      - [Install the NVIDIA Container Toolkit](#install-the-nvidia-container-toolkit)\n      - [Notes on `CRI-O` configuration](#notes-on-cri-o-configuration)\n  - [Enabling GPU Support in Kubernetes](#enabling-gpu-support-in-kubernetes)\n  - [Running GPU Jobs](#running-gpu-jobs)\n- [Configuring the NVIDIA device plugin binary](#configuring-the-nvidia-device-plugin-binary)\n  - [As command line flags or envvars](#as-command-line-flags-or-envvars)\n  - [As a configuration file](#as-a-configuration-file)\n  - [Configuration Option Details](#configuration-option-details)\n  - [Shared Access to GPUs](#shared-access-to-gpus)\n    - [With CUDA Time-Slicing](#with-cuda-time-slicing)\n    - [With CUDA MPS](#with-cuda-mps)\n  - [IMEX Support](#imex-support)\n- [Catalog of Labels](#catalog-of-labels)\n- [Deployment via `helm`](#deployment-via-helm)\n  - [Configuring the device plugin's `helm` chart](#configuring-the-device-plugins-helm-chart)\n    - [Passing configuration to the plugin via a `ConfigMap`](#passing-configuration-to-the-plugin-via-a-configmap)\n      - [Single Config File Example](#single-config-file-example)\n      - [Multiple Config File Example](#multiple-config-file-example)\n      - [Updating Per-Node Configuration With a Node Label](#updating-per-node-configuration-with-a-node-label)\n    - [Setting other helm chart values](#setting-other-helm-chart-values)\n    - [Deploying with gpu-feature-discovery for automatic node labels](#deploying-with-gpu-feature-discovery-for-automatic-node-labels)\n    - [Deploying gpu-feature-discovery in standalone mode](#deploying-gpu-feature-discovery-in-standalone-mode)\n  - [Deploying via `helm install` with a direct URL to the `helm` package](#deploying-via-helm-install-with-a-direct-url-to-the-helm-package)\n- [Building and Running Locally](#building-and-running-locally)\n  - [With Docker](#with-docker)\n    - [Build](#build)\n    - [Run](#run)\n  - [Without Docker](#without-docker)\n    - [Build](#build-1)\n    - [Run](#run-1)\n- [Changelog](#changelog)\n- [Issues and Contributing](#issues-and-contributing)\n  - [Versioning](#versioning)\n  - [Upgrading Kubernetes with the Device Plugin](#upgrading-kubernetes-with-the-device-plugin)\n\n## About\n\nThe NVIDIA device plugin for Kubernetes is a Daemonset that allows you to automatically:\n\n- Expose the number of GPUs on each nodes of your cluster\n- Keep track of the health of your GPUs\n- Run GPU enabled containers in your Kubernetes cluster.\n\nThis repository contains NVIDIA's official implementation of the [Kubernetes device plugin](https:\u002F\u002Fkubernetes.io\u002Fdocs\u002Fconcepts\u002Fextend-kubernetes\u002Fcompute-storage-net\u002Fdevice-plugins\u002F).\nAs of v0.15.0 this repository also holds the implementation for GPU Feature Discovery labels,\nfor further information on GPU Feature Discovery see [here](docs\u002Fgpu-feature-discovery\u002FREADME.md).\n\nPlease note that:\n\n- The NVIDIA device plugin API is beta as of Kubernetes v1.10.\n- The NVIDIA device plugin is currently lacking\n  - Comprehensive GPU health checking features\n  - GPU cleanup features\n- Support will only be provided for the official NVIDIA device plugin (and not\n  for forks or other variants of this plugin).\n\n## Prerequisites\n\nThe list of prerequisites for running the NVIDIA device plugin is described below:\n\n- NVIDIA drivers ~= 384.81\n- nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)\n- nvidia-container-runtime configured as the default low-level runtime\n- Kubernetes version >= 1.10\n\n## Quick Start\n\n### Preparing your GPU Nodes\n\nThe following steps need to be executed on all your GPU nodes.\nThis README assumes that the NVIDIA drivers and the `nvidia-container-toolkit` have been pre-installed.\nIt also assumes that you have configured the `nvidia-container-runtime` as the default low-level runtime to use.\n\nPlease see: https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Finstall-guide.html\n\n#### Example for debian-based systems with `docker` and `containerd`\n\n##### Install the NVIDIA Container Toolkit\n\nFor instructions on installing and getting started with the NVIDIA Container Toolkit, refer to the [installation guide](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Finstall-guide.html#installation-guide).\n\nAlso note the configuration instructions for:\n\n- [`containerd`](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html#configuring-containerd-for-kubernetes)\n- [`CRI-O`](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html#configuring-cri-o)\n- [`docker` (Deprecated)](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html#configuring-docker)\n\nRemembering to restart each runtime after applying the configuration changes.\n\nIf the `nvidia` runtime should be set as the default runtime (with non-cri docker versions, for example), the `--set-as-default` argument\nmust also be included in the commands above. If this is not done, a RuntimeClass needs to be defined:\n\n```yaml\napiVersion: node.k8s.io\u002Fv1\nkind: RuntimeClass\nmetadata:\n  name: nvidia\nhandler: nvidia\n```\n\n##### Notes on `CRI-O` configuration\n\nWhen running `kubernetes` with `CRI-O`, add the config file to set the\n`nvidia-container-runtime` as the default low-level OCI runtime under\n`\u002Fetc\u002Fcrio\u002Fcrio.conf.d\u002F99-nvidia.conf`. This will take priority over the default\n`crun` config file at `\u002Fetc\u002Fcrio\u002Fcrio.conf.d\u002F10-crun.conf`:\n\n```toml\n[crio]\n\n  [crio.runtime]\n    default_runtime = \"nvidia\"\n\n    [crio.runtime.runtimes]\n\n      [crio.runtime.runtimes.nvidia]\n        runtime_path = \"\u002Fusr\u002Fbin\u002Fnvidia-container-runtime\"\n        runtime_type = \"oci\"\n```\n\nAs stated in the linked documentation, this file can automatically be generated with the nvidia-ctk command:\n\n```shell\nsudo nvidia-ctk runtime configure --runtime=crio --set-as-default --config=\u002Fetc\u002Fcrio\u002Fcrio.conf.d\u002F99-nvidia.conf\n```\n\n`CRI-O` uses `crun` as default low-level OCI runtime so `crun` needs to be added\nto the runtimes of the `nvidia-container-runtime` in the config file at `\u002Fetc\u002Fnvidia-container-runtime\u002Fconfig.toml`:\n\n```toml\n[nvidia-container-runtime]\nruntimes = [\"crun\", \"docker-runc\", \"runc\"]\n```\n\nAnd then restart `CRI-O`:\n\n```shell\nsudo systemctl restart crio\n```\n\n### Enabling GPU Support in Kubernetes\n\nOnce you have configured the options above on all the GPU nodes in your\ncluster, you can enable GPU support by deploying the following Daemonset:\n\n```shell\nkubectl create -f https:\u002F\u002Fraw.githubusercontent.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fv0.17.1\u002Fdeployments\u002Fstatic\u002Fnvidia-device-plugin.yml\n```\n\n**Note:** This is a simple static daemonset meant to demonstrate the basic\nfeatures of the `nvidia-device-plugin`. Please see the instructions below for\n[Deployment via `helm`](#deployment-via-helm) when deploying the plugin in a\nproduction setting.\n\n### Running GPU Jobs\n\nWith the daemonset deployed, NVIDIA GPUs can now be requested by a container\nusing the `nvidia.com\u002Fgpu` resource type:\n\n```shell\ncat \u003C\u003CEOF | kubectl apply -f -\napiVersion: v1\nkind: Pod\nmetadata:\n  name: gpu-pod\nspec:\n  restartPolicy: Never\n  containers:\n    - name: cuda-container\n      image: nvcr.io\u002Fnvidia\u002Fk8s\u002Fcuda-sample:vectoradd-cuda12.5.0\n      resources:\n        limits:\n          nvidia.com\u002Fgpu: 1 # requesting 1 GPU\n  tolerations:\n  - key: nvidia.com\u002Fgpu\n    operator: Exists\n    effect: NoSchedule\nEOF\n```\n\n```shell\n$ kubectl logs gpu-pod\n[Vector addition of 50000 elements]\nCopy input data from the host memory to the CUDA device\nCUDA kernel launch with 196 blocks of 256 threads\nCopy output data from the CUDA device to the host memory\nTest PASSED\nDone\n```\n\n> [!WARNING]\n> If you do not request GPUs when you use the device plugin, the plugin exposes all the GPUs on the machine inside your container.\n\n## Configuring the NVIDIA device plugin binary\n\nThe NVIDIA device plugin has a number of options that can be configured for it.\nThese options can be configured as command line flags, environment variables,\nor via a config file when launching the device plugin. Here we explain what\neach of these options are and how to configure them directly against the plugin\nbinary. The following section explains how to set these configurations when\ndeploying the plugin via `helm`.\n\n### As command line flags or envvars\n\n| Flag                     | Environment Variable    | Default Value   |\n|--------------------------|-------------------------|-----------------|\n| `--mig-strategy`         | `$MIG_STRATEGY`         | `\"none\"`        |\n| `--fail-on-init-error`   | `$FAIL_ON_INIT_ERROR`   | `true`          |\n| `--nvidia-driver-root`   | `$NVIDIA_DRIVER_ROOT`   | `\"\u002F\"`           |\n| `--pass-device-specs`    | `$PASS_DEVICE_SPECS`    | `false`         |\n| `--device-list-strategy` | `$DEVICE_LIST_STRATEGY` | `\"envvar\"`      |\n| `--device-id-strategy`   | `$DEVICE_ID_STRATEGY`   | `\"uuid\"`        |\n| `--config-file`          | `$CONFIG_FILE`          | `\"\"`            |\n\n### As a configuration file\n\n```yaml\nversion: v1\nflags:\n  migStrategy: \"none\"\n  failOnInitError: true\n  nvidiaDriverRoot: \"\u002F\"\n  plugin:\n    passDeviceSpecs: false\n    deviceListStrategy: \"envvar\"\n    deviceIDStrategy: \"uuid\"\n```\n\n**Note:** The configuration file has an explicit `plugin` section because it\nis a shared configuration between the plugin and\n[`gpu-feature-discovery`](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fgpu-feature-discovery).\nAll options inside the `plugin` section are specific to the plugin. All\noptions outside of this section are shared.\n\n### Configuration Option Details\n\n**`MIG_STRATEGY`**:\n  the desired strategy for exposing MIG devices on GPUs that support it\n\n  `[none | single | mixed] (default 'none')`\n\n  The `MIG_STRATEGY` option configures the daemonset to be able to expose\n  Multi-Instance GPUs (MIG) on GPUs that support them. More information on what\n  these strategies are and how they should be used can be found in [Supporting\n  Multi-Instance GPUs (MIG) in\n  Kubernetes](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1mdgMQ8g7WmaI_XVVRrCvHPFPOMCm5LQD5JefgAh6N8g).\n\n  **Note:** With a `MIG_STRATEGY` of mixed, you will have additional resources\n  available to you of the form `nvidia.com\u002Fmig-\u003Cslice_count>g.\u003Cmemory_size>gb`\n  that you can set in your pod spec to get access to a specific MIG device.\n\n**`FAIL_ON_INIT_ERROR`**:\n  fail the plugin if an error is encountered during initialization, otherwise block indefinitely\n\n  `(default 'true')`\n\n  When set to true, the `FAIL_ON_INIT_ERROR` option fails the plugin if an error is\n  encountered during initialization. When set to false, it prints an error\n  message and blocks the plugin indefinitely instead of failing. Blocking\n  indefinitely follows legacy semantics that allow the plugin to deploy\n  successfully on nodes that don't have GPUs on them (and aren't supposed to have\n  GPUs on them) without throwing an error. In this way, you can blindly deploy a\n  daemonset with the plugin on all nodes in your cluster, whether they have GPUs\n  on them or not, without encountering an error.  However, doing so means that\n  there is no way to detect an actual error on nodes that are supposed to have\n  GPUs on them. Failing if an initialization error is encountered is now the\n  default and should be adopted by all new deployments.\n\n**`NVIDIA_DRIVER_ROOT`**:\n  the root path for the NVIDIA driver installation\n\n  `(default '\u002F')`\n\n  When the NVIDIA drivers are installed directly on the host, this should be\n  set to `'\u002F'`. When installed elsewhere (e.g. via a driver container), this\n  should be set to the root filesystem where the drivers are installed (e.g.\n  `'\u002Frun\u002Fnvidia\u002Fdriver'`).\n\n  **Note:** This option is only necessary when used in conjunction with the\n  `$PASS_DEVICE_SPECS` option described below. It tells the plugin what prefix\n  to add to any device file paths passed back as part of the device specs.\n\n**`PASS_DEVICE_SPECS`**:\n  pass the paths and desired device node permissions for any NVIDIA devices\n  being allocated to the container\n\n  `(default 'false')`\n\n  This option exists for the sole purpose of allowing the device plugin to\n  interoperate with the `CPUManager` in Kubernetes. Setting this flag also\n  requires one to deploy the daemonset with elevated privileges, so only do so if\n  you know you need to interoperate with the `CPUManager`.\n\n**`DEVICE_LIST_STRATEGY`**:\n  the desired strategy for passing the device list to the underlying runtime\n\n  `[envvar | volume-mounts | cdi-annotations | cdi-cri ] (default 'envvar')`\n\n  **Note**: Multiple device list strategies can be specified (as a comma-separated list).\n\n  The `DEVICE_LIST_STRATEGY` flag allows one to choose which strategy the plugin\n  will use to advertise the list of GPUs allocated to a container. Possible values are:\n\n  - `envvar` (default): the `NVIDIA_VISIBLE_DEVICES` environment variable\n  as described\n  [here](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fnvidia-container-runtime#nvidia_visible_devices)\n  is used to select the devices that are to be injected by the NVIDIA Container Runtime.\n  - `volume-mounts`: the list of devices is passed as a set of volume mounts instead of as an environment variable\n  to instruct the NVIDIA Container Runtime to inject the devices.\n  Details for the\n  rationale behind this strategy can be found\n  [here](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1uXVF-NWZQXgP1MLb87_kMkQvidpnkNWicdpO2l9g-fw\u002Fedit#heading=h.b3ti65rojfy5).\n  - `cdi-annotations`: CDI annotations are used to select the devices that are to be injected.\n  Note that this does not require the NVIDIA Container Runtime, but does required a CDI-enabled container engine.\n  - `cdi-cri`: the `CDIDevices` CRI field is used to select the CDI devices that are to be injected.\n  This requires support in Kubernetes to forward these requests in the CRI to a CDI-enabled container engine.\n\n**`DEVICE_ID_STRATEGY`**:\n  the desired strategy for passing device IDs to the underlying runtime\n\n  `[uuid | index] (default 'uuid')`\n\n  The `DEVICE_ID_STRATEGY` flag allows one to choose which strategy the plugin will\n  use to pass the device ID of the GPUs allocated to a container. The device ID\n  has traditionally been passed as the UUID of the GPU. This flag lets a user\n  decide if they would like to use the UUID or the index of the GPU (as seen in\n  the output of `nvidia-smi`) as the identifier passed to the underlying runtime.\n  Passing the index may be desirable in situations where pods that have been\n  allocated GPUs by the plugin get restarted with different physical GPUs\n  attached to them.\n\n**`CONFIG_FILE`**:\n  point the plugin at a configuration file instead of relying on command line\n  flags or environment variables\n\n  `(default '')`\n\n  The order of precedence for setting each option is (1) command line flag, (2)\n  environment variable, (3) configuration file. In this way, one could use a\n  pre-defined configuration file, but then override the values set in it at\n  launch time. As described below, a `ConfigMap` can be used to point the\n  plugin at a desired configuration file when deploying via `helm`.\n\n### Shared Access to GPUs\n\nThe NVIDIA device plugin allows oversubscription of GPUs through a set of\nextended options in its configuration file. There are two flavors of sharing\navailable: Time-Slicing and MPS.\n\n> [!NOTE]\n> Time-slicing and MPS are mutually exclusive.\n\nIn the case of time-slicing, CUDA time-slicing is used to allow workloads sharing a GPU to\ninterleave with each other. However, nothing special is done to isolate workloads that are\ngranted replicas from the same underlying GPU, and each workload has access to\nthe GPU memory and runs in the same fault-domain as of all the others (meaning\nif one workload crashes, they all do).\n\nIn the case of MPS, a control daemon is used to manage access to the shared GPU.\nIn contrast to time-slicing, MPS does space partitioning and allows memory and\ncompute resources to be explicitly partitioned and enforces these limits per\nworkload.\n\nWith both time-slicing and MPS, the same sharing method is applied to all GPUs on\na node. You cannot configure sharing on a per-GPU basis.\n\n#### With CUDA Time-Slicing\n\nThe extended options for sharing using time-slicing can be seen below:\n\n```yaml\nversion: v1\nsharing:\n  timeSlicing:\n    renameByDefault: \u003Cbool>\n    failRequestsGreaterThanOne: \u003Cbool>\n    resources:\n    - name: \u003Cresource-name>\n      replicas: \u003Cnum-replicas>\n    ...\n```\n\nThat is, for each named resource under `sharing.timeSlicing.resources`, a number\nof replicas can now be specified for that resource type. These replicas\nrepresent the number of shared accesses that will be granted for a GPU\nrepresented by that resource type.\n\nIf `renameByDefault=true`, then each resource will be advertised under the name\n`\u003Cresource-name>.shared` instead of simply `\u003Cresource-name>`.\n\nIf `failRequestsGreaterThanOne=true`, then the plugin will fail to allocate any\nshared resources to a container if they request more than one. The container’s\npod will fail with an `UnexpectedAdmissionError` and need to be manually deleted,\nupdated, and redeployed.\n\nFor example:\n\n```yaml\nversion: v1\nsharing:\n  timeSlicing:\n    resources:\n    - name: nvidia.com\u002Fgpu\n      replicas: 10\n```\n\nIf this configuration were applied to a node with 8 GPUs on it, the plugin\nwould now advertise 80 `nvidia.com\u002Fgpu` resources to Kubernetes instead of 8.\n\n```shell\n$ kubectl describe node\n...\nCapacity:\n  nvidia.com\u002Fgpu: 80\n...\n```\n\nLikewise, if the following configuration were applied to a node, then 80\n`nvidia.com\u002Fgpu.shared` resources would be advertised to Kubernetes instead of 8\n`nvidia.com\u002Fgpu` resources.\n\n```yaml\nversion: v1\nsharing:\n  timeSlicing:\n    renameByDefault: true\n    resources:\n    - name: nvidia.com\u002Fgpu\n      replicas: 10\n    ...\n```\n\n```shell\n$ kubectl describe node\n...\nCapacity:\n  nvidia.com\u002Fgpu.shared: 80\n...\n```\n\nIn both cases, the plugin simply creates 10 references to each GPU and\nindiscriminately hands them out to anyone that asks for them.\n\nIf `failRequestsGreaterThanOne=true` were set in either of these\nconfigurations and a user requested more than one `nvidia.com\u002Fgpu` or\n`nvidia.com\u002Fgpu.shared` resource in their pod spec, then the container would\nfail with the resulting error:\n\n```shell\n$ kubectl describe pod gpu-pod\n...\nEvents:\n  Type     Reason                    Age   From               Message\n  ----     ------                    ----  ----               -------\n  Warning  UnexpectedAdmissionError  13s   kubelet            Allocate failed due to rpc error: code = Unknown desc = request for 'nvidia.com\u002Fgpu: 2' too large: maximum request size for shared resources is 1, which is unexpected\n...\n```\n\n**Note:** Unlike with \"normal\" GPU requests, requesting more than one shared\nGPU does not imply that you will get guaranteed access to a proportional amount\nof compute power. It only implies that you will get access to a GPU that is\nshared by other clients (each of which has the freedom to run as many processes\non the underlying GPU as they want). Under the hood CUDA will simply give an\nequal share of time to all of the GPU processes across all of the clients. The\n`failRequestsGreaterThanOne` flag is meant to help users understand this\nsubtlety, by treating a request of `1` as an access request rather than an\nexclusive resource request. Setting `failRequestsGreaterThanOne=true` is\nrecommended, but it is set to `false` by default to retain backwards\ncompatibility.\n\nAs of now, the only supported resource available for time-slicing are\n`nvidia.com\u002Fgpu` as well as any of the resource types that emerge from\nconfiguring a node with the mixed MIG strategy.\n\nFor example, the full set of time-sliceable resources on a T4 card would be:\n\n```\nnvidia.com\u002Fgpu\n```\n\nAnd the full set of time-sliceable resources on an A100 40GB card would be:\n\n```\nnvidia.com\u002Fgpu\nnvidia.com\u002Fmig-1g.5gb\nnvidia.com\u002Fmig-2g.10gb\nnvidia.com\u002Fmig-3g.20gb\nnvidia.com\u002Fmig-7g.40gb\n```\n\nLikewise, on an A100 80GB card, they would be:\n\n```\nnvidia.com\u002Fgpu\nnvidia.com\u002Fmig-1g.10gb\nnvidia.com\u002Fmig-2g.20gb\nnvidia.com\u002Fmig-3g.40gb\nnvidia.com\u002Fmig-7g.80gb\n```\n\n#### With CUDA MPS\n\n> [!WARNING]\n> As of v0.15.0 of the device plugin, MPS support is considered experimental. Please see the [release notes](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Freleases\u002Ftag\u002Fv0.15.0) for further details.\n\n> [!NOTE]\n> Sharing with MPS is currently not supported on devices with MIG enabled.\n\nThe extended options for sharing using MPS can be seen below:\n\n```yaml\nversion: v1\nsharing:\n  mps:\n    renameByDefault: \u003Cbool>\n    resources:\n    - name: \u003Cresource-name>\n      replicas: \u003Cnum-replicas>\n    ...\n```\n\nThat is, for each named resource under `sharing.mps.resources`, a number\nof replicas can be specified for that resource type. As is the case with\ntime-slicing, these replicas represent the number of shared accesses that will\nbe granted for a GPU associated with that resource type. In contrast with\ntime-slicing, the amount of memory allowed per client (i.e. per partition) is\nmanaged by the MPS control daemon and limited to an equal fraction of the total\ndevice memory. In addition to controlling the amount of memory that each client\ncan consume, the MPS control daemon also limits the amount of compute capacity\nthat can be consumed by a client.\n\nIf `renameByDefault=true`, then each resource will be advertised under the name\n`\u003Cresource-name>.shared` instead of simply `\u003Cresource-name>`.\n\nFor example:\n\n```yaml\nversion: v1\nsharing:\n  mps:\n    resources:\n    - name: nvidia.com\u002Fgpu\n      replicas: 10\n```\n\nIf this configuration were applied to a node with 8 GPUs on it, the plugin\nwould now advertise 80 `nvidia.com\u002Fgpu` resources to Kubernetes instead of 8.\n\n```shell\n$ kubectl describe node\n...\nCapacity:\n  nvidia.com\u002Fgpu: 80\n...\n```\n\nLikewise, if the following configuration were applied to a node, then 80\n`nvidia.com\u002Fgpu.shared` resources would be advertised to Kubernetes instead of 8\n`nvidia.com\u002Fgpu` resources.\n\n```yaml\nversion: v1\nsharing:\n  mps:\n    renameByDefault: true\n    resources:\n    - name: nvidia.com\u002Fgpu\n      replicas: 10\n    ...\n```\n\n```shell\n$ kubectl describe node\n...\nCapacity:\n  nvidia.com\u002Fgpu.shared: 80\n...\n```\n\nFurthermore, each of these resources -- either `nvidia.com\u002Fgpu` or\n`nvidia.com\u002Fgpu.shared` -- would have access to the same fraction (1\u002F10) of the\ntotal memory and compute resources of the GPU.\n\n**Note**: As of now, the only supported resource available for MPS are `nvidia.com\u002Fgpu`\nresources and only with full GPUs.\n\n### IMEX Support\n\nThe NVIDIA GPU Device Plugin can be configured to inject IMEX channels into\nworkloads.\n\nThis opt-in behavior is global and affects all workloads and is controlled by\nthe `imex.channelIDs` and `imex.required` configuration options.\n\n| `imex.channelIDs` | `imex.required` | Effect |\n|---|---|---|\n| `[]` | * | (default) No IMEX channels are added to workload requests. Note that the `imex.required` field has no effect in this case |\n| `[0]` | `false` | If the requested IMEX channel (`0`) is discoverable by the NVIDIA GPU Device Plugin, the channel will be added to each workload request. If the channel cannot be discovered no channels are added to workload requests. |\n| `[0]` | `true` | If the requested IMEX channel (`0`) is discoverable by the NVIDIA GPU Device Plugin, the channel will be added to each workload request. If the channel cannot be discovered an error will be raised since the channel was marked as `required`. |\n\n**Note**: At present the only valid `imex.channelIDs` configurations are `[]` and `[0]`.\n\nFor the containerized NVIDIA GPU Device Plugin running to be able to successfully\ndiscover available IMEX channels, the corresponding device nodes must be available\nto the container.\n\n## Catalog of Labels\n\nThe NVIDIA device plugin reads and writes a number of different labels that it uses as either\nconfiguration elements or informational elements. The following table documents and describes each label\nalong with their use. See the related table [here](\u002Fdocs\u002Fgpu-feature-discovery\u002FREADME.md#generated-labels) for the labels GFD adds.\n\n| Label Name                          | Description                                                                                                                                                                                                                                  | Example        |\n| ----------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------- |\n| nvidia.com\u002Fdevice-plugin.config     | Specifies the configuration to apply to the node. You apply this this label to perform per-node configuration. Refer to [Updating Per-Node Configuration With a Node Label](#updating-per-node-configuration-with-a-node-label) for details. | my-mps-config  |\n| nvidia.com\u002Fgpu.sharing-strategy     | Specifies the sharing strategy. The default value, `none`, indicates no sharing.  Other values are `mps` and `time-slicing`.                                                                                                                 | time-slicing   |\n| nvidia.com\u002Fmig.capable              | Specifies if any device on the node supports MIG.                                                                                                                                                                                            | false          |\n| nvidia.com\u002Fmps.capable              | Specifies if devices on the node are configured for MPS.                                                                                                                                                                                     | false          |\n| nvidia.com\u002Fvgpu.present             | Specifies if devices on the node use vGPU.                                                                                                                                                                                                   | false          |\n| nvidia.com\u002Fvgpu.host-driver-branch  | Specifies the vGPU host driver branch on the underlying hypervisor.                                                                                                                                                                          | r550_40        |\n| nvidia.com\u002Fvgpu.host-driver-version | Specifies the vGPU host driver version on the underlying hypervisor.                                                                                                                                                                         | 550.54.16      |\n\n## Deployment via `helm`\n\nThe preferred method to deploy the device plugin is as a daemonset using `helm`.\nInstructions for installing `helm` can be found\n[here](https:\u002F\u002Fhelm.sh\u002Fdocs\u002Fintro\u002Finstall\u002F).\n\nBegin by setting up the plugin's `helm` repository and updating it at follows:\n\n```shell\nhelm repo add nvdp https:\u002F\u002Fnvidia.github.io\u002Fk8s-device-plugin\nhelm repo update\n```\n\nThen verify that the latest release (`v0.17.1`) of the plugin is available:\n\n```shell\n$ helm search repo nvdp --devel\nNAME                     \t  CHART VERSION  APP VERSION\tDESCRIPTION\nnvdp\u002Fnvidia-device-plugin\t  0.17.1\t 0.17.1\t\tA Helm chart for ...\n```\n\nOnce this repo is updated, you can begin installing packages from it to deploy\nthe `nvidia-device-plugin` helm chart.\n\nThe most basic installation command without any options is then:\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --version 0.17.1\n```\n\n**Note:** You only need the to pass the `--devel` flag to `helm search repo`\nand the `--version` flag to `helm upgrade -i` if this is a pre-release\nversion (e.g. `\u003Cversion>-rc.1`). Full releases will be listed without this.\n\n### Configuring the device plugin's `helm` chart\n\nThe `helm` chart for the latest release of the plugin (`v0.17.1`) includes\na number of customizable values.\n\nPrior to `v0.12.0` the most commonly used values were those that had direct\nmappings to the command line options of the plugin binary. As of `v0.12.0`, the\npreferred method to set these options is via a `ConfigMap`. The primary use\ncase of the original values is then to override an option from the `ConfigMap`\nif desired. Both methods are discussed in more detail below.\n\nThe full set of values that can be set are found here:\n[here](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fblob\u002Fv0.17.1\u002Fdeployments\u002Fhelm\u002Fnvidia-device-plugin\u002Fvalues.yaml).\n\n#### Passing configuration to the plugin via a `ConfigMap`\n\nIn general, we provide a mechanism to pass _multiple_ configuration files to\nto the plugin's `helm` chart, with the ability to choose which configuration\nfile should be applied to a node via a node label.\n\nIn this way, a single chart can be used to deploy each component, but custom\nconfigurations can be applied to different nodes throughout the cluster.\n\nThere are two ways to provide a `ConfigMap` for use by the plugin:\n\n  1. Via an external reference to a pre-defined `ConfigMap`\n  1. As a set of named config files to build an integrated `ConfigMap` associated with the chart\n\nThese can be set via the chart values `config.name` and `config.map` respectively.\nIn both cases, the value `config.default` can be set to point to one of the\nnamed configs in the `ConfigMap` and provide a default configuration for nodes\nthat have not been customized via a node label (more on this later).\n\n##### Single Config File Example\n\nAs an example, create a valid config file on your local filesystem, such as the following:\n\n```shell\ncat \u003C\u003C EOF > \u002Ftmp\u002Fdp-example-config0.yaml\nversion: v1\nflags:\n  migStrategy: \"none\"\n  failOnInitError: true\n  nvidiaDriverRoot: \"\u002F\"\n  plugin:\n    passDeviceSpecs: false\n    deviceListStrategy: envvar\n    deviceIDStrategy: uuid\nEOF\n```\n\nAnd deploy the device plugin via helm (pointing it at this config file and giving it a name):\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set-file config.map.config=\u002Ftmp\u002Fdp-example-config0.yaml\n```\n\nUnder the hood this will deploy a `ConfigMap` associated with the plugin and put\nthe contents of the `dp-example-config0.yaml` file into it, using the name\n`config` as its key. It will then start the plugin such that this config gets\napplied when the plugin comes online.\n\nIf you don’t want the plugin’s helm chart to create the `ConfigMap` for you, you\ncan also point it at a pre-created `ConfigMap` as follows:\n\n```shell\nkubectl create ns nvidia-device-plugin\n```\n\n```shell\nkubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \\\n  --from-file=config=\u002Ftmp\u002Fdp-example-config0.yaml\n```\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set config.name=nvidia-plugin-configs\n```\n\n##### Multiple Config File Example\n\nFor multiple config files, the procedure is similar.\n\nCreate a second `config` file with the following contents:\n\n```shell\ncat \u003C\u003C EOF > \u002Ftmp\u002Fdp-example-config1.yaml\nversion: v1\nflags:\n  migStrategy: \"mixed\" # Only change from config0.yaml\n  failOnInitError: true\n  nvidiaDriverRoot: \"\u002F\"\n  plugin:\n    passDeviceSpecs: false\n    deviceListStrategy: envvar\n    deviceIDStrategy: uuid\nEOF\n```\n\nAnd redeploy the device plugin via helm (pointing it at both configs with a specified default).\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set config.default=config0 \\\n  --set-file config.map.config0=\u002Ftmp\u002Fdp-example-config0.yaml \\\n  --set-file config.map.config1=\u002Ftmp\u002Fdp-example-config1.yaml\n```\n\nAs before, this can also be done with a pre-created `ConfigMap` if desired:\n\n```shell\nkubectl create ns nvidia-device-plugin\n```\n\n```shell\nkubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \\\n  --from-file=config0=\u002Ftmp\u002Fdp-example-config0.yaml \\\n  --from-file=config1=\u002Ftmp\u002Fdp-example-config1.yaml\n```\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set config.default=config0 \\\n  --set config.name=nvidia-plugin-configs\n```\n\n**Note:** If the `config.default` flag is not explicitly set, then a default\nvalue will be inferred from the config if one of the config names is set to\n'`default`'. If neither of these are set, then the deployment will fail unless\nthere is only **_one_** config provided. In the case of just a single config being\nprovided, it will be chosen as the default because there is no other option.\n\n##### Updating Per-Node Configuration With a Node Label\n\nWith this setup, plugins on all nodes will have `config0` configured for them\nby default. However, the following label can be set to change which\nconfiguration is applied:\n\n```shell\nkubectl label nodes \u003Cnode-name> –-overwrite \\\n  nvidia.com\u002Fdevice-plugin.config=\u003Cconfig-name>\n```\n\nFor example, applying a custom config for all nodes that have T4 GPUs installed\non them might be:\n\n```shell\nkubectl label node \\\n  --overwrite \\\n  --selector=nvidia.com\u002Fgpu.product=TESLA-T4 \\\n  nvidia.com\u002Fdevice-plugin.config=t4-config\n```\n\n**Note:** This label can be applied either _before_ or _after_ the plugin is\nstarted to get the desired configuration applied on the node. Anytime it\nchanges value, the plugin will immediately be updated to start serving the\ndesired configuration. If it is set to an unknown value, it will skip\nreconfiguration. If it is ever unset, it will fallback to the default.\n\n#### Setting other helm chart values\n\nAs mentioned previously, the device plugin's helm chart continues to provide\ndirect values to set the configuration options of the plugin without using a\n`ConfigMap`. These should only be used to set globally applicable options\n(which should then never be embedded in the set of config files provided by the\n`ConfigMap`), or used to override these options as desired.\n\nThese values are as follows:\n\n```yaml\n  migStrategy:\n      the desired strategy for exposing MIG devices on GPUs that support it\n      [none | single | mixed] (default \"none\")\n  failOnInitError:\n      fail the plugin if an error is encountered during initialization, otherwise block indefinitely\n      (default 'true')\n  compatWithCPUManager:\n      run with escalated privileges to be compatible with the static CPUManager policy\n      (default 'false')\n  deviceListStrategy:\n      the desired strategy for passing the device list to the underlying runtime\n      [envvar | volume-mounts | cdi-annotations | cdi-cri] (default \"envvar\")\n  deviceIDStrategy:\n      the desired strategy for passing device IDs to the underlying runtime\n      [uuid | index] (default \"uuid\")\n  nvidiaDriverRoot:\n      the root path for the NVIDIA driver installation (typical values are '\u002F' or '\u002Frun\u002Fnvidia\u002Fdriver')\n```\n\n**Note:**  There is no value that directly maps to the `PASS_DEVICE_SPECS`\nconfiguration option of the plugin. Instead a value called\n`compatWithCPUManager` is provided which acts as a proxy for this option.\nIt both sets the `PASS_DEVICE_SPECS` option of the plugin to true **AND** makes\nsure that the plugin is started with elevated privileges to ensure proper\ncompatibility with the `CPUManager`.\n\nBesides these custom configuration options for the plugin, other standard helm\nchart values that are commonly overridden are:\n\n```yaml\nruntimeClassName:\n  the runtimeClassName to use, for use with clusters that have multiple runtimes. (typical value is 'nvidia')\n```\n\nPlease take a look in the\n[`values.yaml`](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fblob\u002Fv0.17.1\u002Fdeployments\u002Fhelm\u002Fnvidia-device-plugin\u002Fvalues.yaml)\nfile to see the full set of overridable parameters for the device plugin.\n\nExamples of setting these options include:\n\nEnabling compatibility with the `CPUManager` and running with a request for\n100ms of CPU time and a limit of 512MB of memory.\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set compatWithCPUManager=true \\\n  --set resources.requests.cpu=100m \\\n  --set resources.limits.memory=512Mi\n```\n\nEnabling compatibility with the `CPUManager` and the `mixed` `migStrategy`.\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set compatWithCPUManager=true \\\n  --set migStrategy=mixed\n```\n\n#### Deploying with gpu-feature-discovery for automatic node labels\n\nAs of `v0.12.0`, the device plugin's helm chart has integrated support to\ndeploy\n[`gpu-feature-discovery`](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fgpu-feature-discovery)\n(GFD). You can use GFD to automatically generate labels for the\nset of GPUs available on a node. Under the hood, it leverages [Node Feature Discovery](https:\u002F\u002Fkubernetes-sigs.github.io\u002Fnode-feature-discovery\u002Fstable\u002Fget-started\u002Findex.html) to perform this labeling.\n\nTo enable it, simply set `gfd.enabled=true` during helm install.\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set gfd.enabled=true\n```\n\nUnder the hood this will also deploy\n[`node-feature-discovery`](https:\u002F\u002Fgithub.com\u002Fkubernetes-sigs\u002Fnode-feature-discovery)\n(NFD) since it is a prerequisite of GFD. If you already have NFD deployed on\nyour cluster and do not wish for it to be pulled in by this installation, you\ncan disable it with `nfd.enabled=false`.\n\nIn addition to the standard node labels applied by GFD, the following label\nwill also be included when deploying the plugin with the time-slicing or MPS extensions\ndescribed [above](#shared-access-to-gpus).\n\n```\nnvidia.com\u002F\u003Cresource-name>.replicas = \u003Cnum-replicas>\n```\n\nAdditionally, the `nvidia.com\u002F\u003Cresource-name>.product` will be modified as follows if\n`renameByDefault=false`.\n\n```\nnvidia.com\u002F\u003Cresource-name>.product = \u003Cproduct name>-SHARED\n```\n\nUsing these labels, users have a way of selecting a shared vs. non-shared GPU\nin the same way they would traditionally select one GPU model over another.\nThat is, the `SHARED` annotation ensures that a `nodeSelector` can be used to\nattract pods to nodes that have shared GPUs on them.\n\nSince having `renameByDefault=true` already encodes the fact that the resource is\nshared on the resource name, there is no need to annotate the product\nname with `SHARED`. Users can already find the shared resources they need by\nsimply requesting it in their pod spec.\n\nNote: When running with `renameByDefault=false` and `migStrategy=single` both\nthe MIG profile name and the new `SHARED` annotation will be appended to the\nproduct name, e.g.:\n\n```\nnvidia.com\u002Fgpu.product = A100-SXM4-40GB-MIG-1g.5gb-SHARED\n```\n\n#### Deploying gpu-feature-discovery in standalone mode\n\nAs of v0.15.0, the device plugin's helm chart has integrated support to deploy\n[`gpu-feature-discovery`](\u002Fdocs\u002Fgpu-feature-discovery\u002FREADME.md#overview)\n\nWhen gpu-feature-discovery in deploying standalone, begin by setting up the\nplugin's `helm` repository and updating it at follows:\n\n```shell\nhelm repo add nvdp https:\u002F\u002Fnvidia.github.io\u002Fk8s-device-plugin\nhelm repo update\n```\n\nThen verify that the latest release (`v0.17.1`) of the plugin is available\n(Note that this includes the GFD chart):\n\n```shell\nhelm search repo nvdp --devel\nNAME                     \t  CHART VERSION  APP VERSION\tDESCRIPTION\nnvdp\u002Fnvidia-device-plugin\t  0.17.1\t 0.17.1\t\tA Helm chart for ...\n```\n\nOnce this repo is updated, you can begin installing packages from it to deploy\nthe `gpu-feature-discovery` component in standalone mode.\n\nThe most basic installation command without any options is then:\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version 0.17.1 \\\n  --namespace gpu-feature-discovery \\\n  --create-namespace \\\n  --set devicePlugin.enabled=false\n```\n\nDisabling auto-deployment of NFD and running with a MIG strategy of 'mixed' in\nthe default namespace.\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n    --version=0.17.1 \\\n    --set allowDefaultNamespace=true \\\n    --set nfd.enabled=false \\\n    --set migStrategy=mixed \\\n    --set devicePlugin.enabled=false\n```\n\n**Note:** You only need the to pass the `--devel` flag to `helm search repo`\nand the `--version` flag to `helm upgrade -i` if this is a pre-release\nversion (e.g. `\u003Cversion>-rc.1`). Full releases will be listed without this.\n\n### Deploying via `helm install` with a direct URL to the `helm` package\n\nIf you prefer not to install from the `nvidia-device-plugin` `helm` repo, you can\nrun `helm install` directly against the tarball of the plugin's `helm` package.\nThe example below installs the same chart as the method above, except that\nit uses a direct URL to the `helm` chart instead of via the `helm` repo.\n\nUsing the default values for the flags:\n\n```shell\nhelm upgrade -i nvdp \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  https:\u002F\u002Fnvidia.github.io\u002Fk8s-device-plugin\u002Fstable\u002Fnvidia-device-plugin-0.17.1.tgz\n```\n\n## Building and Running Locally\n\nThe next sections are focused on building the device plugin locally and running it.\nIt is intended purely for development and testing, and not required by most users.\nIt assumes you are pinning to the latest release tag (i.e. `v0.17.1`), but can\neasily be modified to work with any available tag or branch.\n\n### With Docker\n\n#### Build\n\nOption 1, pull the prebuilt image from [Docker Hub](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fnvidia\u002Fk8s-device-plugin):\n\n```shell\ndocker pull nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:v0.17.1\ndocker tag nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:v0.17.1 nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:devel\n```\n\nOption 2, build without cloning the repository:\n\n```shell\ndocker build \\\n  -t nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:devel \\\n  -f deployments\u002Fcontainer\u002FDockerfile.ubuntu \\\n  https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin.git#v0.17.1\n```\n\nOption 3, if you want to modify the code:\n\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin.git && cd k8s-device-plugin\ndocker build \\\n  -t nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:devel \\\n  -f deployments\u002Fcontainer\u002FDockerfile.ubuntu \\\n  .\n```\n\n#### Run\n\nWithout compatibility for the `CPUManager` static policy:\n\n```shell\ndocker run \\\n  -it \\\n  --security-opt=no-new-privileges \\\n  --cap-drop=ALL \\\n  --network=none \\\n  -v \u002Fvar\u002Flib\u002Fkubelet\u002Fdevice-plugins:\u002Fvar\u002Flib\u002Fkubelet\u002Fdevice-plugins \\\n  nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:devel\n```\n\nWith compatibility for the `CPUManager` static policy:\n\n```shell\ndocker run \\\n  -it \\\n  --privileged \\\n  --network=none \\\n  -v \u002Fvar\u002Flib\u002Fkubelet\u002Fdevice-plugins:\u002Fvar\u002Flib\u002Fkubelet\u002Fdevice-plugins \\\n  nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:devel --pass-device-specs\n```\n\n### Without Docker\n\n#### Build\n\n```shell\nC_INCLUDE_PATH=\u002Fusr\u002Flocal\u002Fcuda\u002Finclude LIBRARY_PATH=\u002Fusr\u002Flocal\u002Fcuda\u002Flib64 go build\n```\n\n#### Run\n\nWithout compatibility for the `CPUManager` static policy:\n\n```shell\n.\u002Fk8s-device-plugin\n```\n\nWith compatibility for the `CPUManager` static policy:\n\n```shell\n.\u002Fk8s-device-plugin --pass-device-specs\n```\n\n## Changelog\n\nSee the [changelog](CHANGELOG.md)\n\n## Issues and Contributing\n\n[Checkout the Contributing document!](CONTRIBUTING.md)\n\n- You can report a bug by [filing a new issue](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fissues\u002Fnew)\n- You can contribute by opening a [pull request](https:\u002F\u002Fhelp.github.com\u002Farticles\u002Fusing-pull-requests\u002F)\n\n### Versioning\n\nBefore v1.10 the versioning scheme of the device plugin had to match exactly the version of Kubernetes.\nAfter the promotion of device plugins to beta this condition was was no longer required.\nWe quickly noticed that this versioning scheme was very confusing for users as they still expected to see\na version of the device plugin for each version of Kubernetes.\n\nThis versioning scheme applies to the tags `v1.8`, `v1.9`, `v1.10`, `v1.11`, `v1.12`.\n\nWe have now changed the versioning to follow [SEMVER](https:\u002F\u002Fsemver.org\u002F). The\nfirst version following this scheme has been tagged `v0.0.0`.\n\nGoing forward, the major version of the device plugin will only change\nfollowing a change in the device plugin API itself. For example, version\n`v1beta1` of the device plugin API corresponds to version `v0.x.x` of the\ndevice plugin. If a new `v2beta2` version of the device plugin API comes out,\nthen the device plugin will increase its major version to `1.x.x`.\n\nAs of now, the device plugin API for Kubernetes >= v1.10 is `v1beta1`.  If you\nhave a version of Kubernetes >= 1.10 you can deploy any device plugin version >\n`v0.0.0`.\n\n### Upgrading Kubernetes with the Device Plugin\n\nUpgrading Kubernetes when you have a device plugin deployed doesn't require you\nto do any, particular changes to your workflow.  The API is versioned and is\npretty stable (though it is not guaranteed to be non breaking). Starting with\nKubernetes version 1.10, you can use `v0.3.0` of the device plugin to perform\nupgrades, and Kubernetes won't require you to deploy a different version of the\ndevice plugin. Once a node comes back online after the upgrade, you will see\nGPUs re-registering themselves automatically.\n\nUpgrading the device plugin itself is a more complex task. It is recommended to\ndrain GPU tasks as we cannot guarantee that GPU tasks will survive a rolling\nupgrade. However we make best efforts to preserve GPU tasks during an upgrade.\n","# 适用于 Kubernetes 的 NVIDIA 设备插件\n\n[![端到端测试](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Factions\u002Fworkflows\u002Fe2e.yaml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Factions\u002Fworkflows\u002Fe2e.yaml) [![Go Report Card](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA_k8s-device-plugin_readme_2b4a70945b89.png)](https:\u002F\u002Fgoreportcard.com\u002Freport\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin) [![最新版本](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002FNVIDIA\u002Fk8s-device-plugin)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Freleases\u002Flatest)\n\n## 目录\n\n- [简介](#about)\n- [先决条件](#prerequisites)\n- [快速入门](#quick-start)\n  - [准备您的 GPU 节点](#preparing-your-gpu-nodes)\n    - [基于 Debian 的系统，使用 `docker` 和 `containerd` 的示例](#example-for-debian-based-systems-with-docker-and-containerd)\n      - [安装 NVIDIA 容器工具包](#install-the-nvidia-container-toolkit)\n      - [关于 `CRI-O` 配置的说明](#notes-on-cri-o-configuration)\n  - [在 Kubernetes 中启用 GPU 支持](#enabling-gpu-support-in-kubernetes)\n  - [运行 GPU 作业](#running-gpu-jobs)\n- [配置 NVIDIA 设备插件二进制文件](#configuring-the-nvidia-device-plugin-binary)\n  - [作为命令行标志或环境变量](#as-command-line-flags-or-envvars)\n  - [作为配置文件](#as-a-configuration-file)\n  - [配置选项详情](#configuration-option-details)\n  - [GPU 的共享访问](#shared-access-to-gpus)\n    - [使用 CUDA 时间片](#with-cuda-time-slicing)\n    - [使用 CUDA MPS](#with-cuda-mps)\n  - [IMEX 支持](#imex-support)\n- [标签目录](#catalog-of-labels)\n- [通过 `helm` 部署](#deployment-via-helm)\n  - [配置设备插件的 `helm` 图表](#configuring-the-device-plugins-helm-chart)\n    - [通过 `ConfigMap` 向插件传递配置](#passing-configuration-to-the-plugin-via-a-configmap)\n      - [单个配置文件示例](#single-config-file-example)\n      - [多个配置文件示例](#multiple-config-file-example)\n      - [使用节点标签更新每个节点的配置](#updating-per-node-configuration-with-a-node-label)\n    - [设置其他 `helm` 图表值](#setting-other-helm-chart-values)\n    - [使用 gpu-feature-discovery 自动添加节点标签进行部署](#deploying-with-gpu-feature-discovery-for-automatic-node-labels)\n    - [以独立模式部署 gpu-feature-discovery](#deploying-gpu-feature-discovery-in-standalone-mode)\n  - [通过 `helm install` 并直接指向 `helm` 包 URL 进行部署](#deploying-via-helm-install-with-a-direct-url-to-the-helm-package)\n- [本地构建和运行](#building-and-running-locally)\n  - [使用 Docker](#with-docker)\n    - [构建](#build)\n    - [运行](#run)\n  - [不使用 Docker](#without-docker)\n    - [构建](#build-1)\n    - [运行](#run-1)\n- [变更日志](#changelog)\n- [问题与贡献](#issues-and-contributing)\n  - [版本控制](#versioning)\n  - [使用设备插件升级 Kubernetes](#upgrading-kubernetes-with-the-device-plugin)\n\n## 关于\n\n适用于 Kubernetes 的 NVIDIA 设备插件是一个 DaemonSet，它允许您自动：\n\n- 暴露集群中每个节点上的 GPU 数量\n- 跟踪 GPU 的健康状况\n- 在您的 Kubernetes 集群中运行启用了 GPU 的容器。\n\n此仓库包含 NVIDIA 对 [Kubernetes 设备插件](https:\u002F\u002Fkubernetes.io\u002Fdocs\u002Fconcepts\u002Fextend-kubernetes\u002Fcompute-storage-net\u002Fdevice-plugins\u002F) 的官方实现。\n自 v0.15.0 版本起，该仓库还包含了 GPU 功能发现标签的实现，\n有关 GPU 功能发现的更多信息，请参阅 [此处](docs\u002Fgpu-feature-discovery\u002FREADME.md)。\n\n请注意以下几点：\n\n- NVIDIA 设备插件 API 自 Kubernetes v1.10 起处于测试阶段。\n- 当前 NVIDIA 设备插件尚缺乏：\n  - 全面的 GPU 健康检查功能\n  - GPU 清理功能\n- 仅对官方 NVIDIA 设备插件提供支持（而不包括其分支或其他变体）。\n\n## 先决条件\n\n运行 NVIDIA 设备插件所需的先决条件如下：\n\n- NVIDIA 驱动程序 ~= 384.81\n- nvidia-docker >= 2.0 或 nvidia-container-toolkit >= 1.7.0（若要在基于 Tegra 的系统上使用集成 GPU，则需 >= 1.11.0）\n- 将 nvidia-container-runtime 配置为默认的底层运行时\n- Kubernetes 版本 >= 1.10\n\n## 快速入门\n\n### 准备您的 GPU 节点\n\n以下步骤需要在所有 GPU 节点上执行。  \n本 README 假设 NVIDIA 驱动程序和 `nvidia-container-toolkit` 已经预先安装完毕，并且您已将 `nvidia-container-runtime` 配置为默认的底层运行时。\n\n请参阅：https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Finstall-guide.html\n\n#### 基于 Debian 的系统，使用 `docker` 和 `containerd` 的示例\n\n##### 安装 NVIDIA Container Toolkit\n\n有关安装和开始使用 NVIDIA Container Toolkit 的说明，请参阅[安装指南](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Finstall-guide.html#installation-guide)。\n\n同时请注意针对以下运行时的配置说明：\n\n- [`containerd`](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html#configuring-containerd-for-kubernetes)\n- [`CRI-O`](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html#configuring-cri-o)\n- [`docker`（已弃用）](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html#configuring-docker)\n\n请记住，在应用配置更改后，需重启每个运行时。\n\n如果应将 `nvidia` 运行时代理设置为默认运行时（例如对于非 CRI 的 Docker 版本），则必须在上述命令中包含 `--set-as-default` 参数。若未执行此操作，则需要定义一个 RuntimeClass：\n\n```yaml\napiVersion: node.k8s.io\u002Fv1\nkind: RuntimeClass\nmetadata:\n  name: nvidia\nhandler: nvidia\n```\n\n##### 关于 `CRI-O` 配置的注意事项\n\n当使用 `CRI-O` 运行 `kubernetes` 时，需添加配置文件以将 `nvidia-container-runtime` 设置为默认的底层 OCI 运行时，文件路径为 `\u002Fetc\u002Fcrio\u002Fcrio.conf.d\u002F99-nvidia.conf`。该文件将优先于默认的 `crun` 配置文件 `\u002Fetc\u002Fcrio\u002Fcrio.conf.d\u002F10-crun.conf`：\n\n```toml\n[crio]\n\n  [crio.runtime]\n    default_runtime = \"nvidia\"\n\n    [crio.runtime.runtimes]\n\n      [crio.runtime.runtimes.nvidia]\n        runtime_path = \"\u002Fusr\u002Fbin\u002Fnvidia-container-runtime\"\n        runtime_type = \"oci\"\n```\n\n如链接文档所述，此文件可使用 nvidia-ctk 命令自动生成：\n\n```shell\nsudo nvidia-ctk runtime configure --runtime=crio --set-as-default --config=\u002Fetc\u002Fcrio\u002Fcrio.conf.d\u002F99-nvidia.conf\n```\n\n由于 `CRI-O` 默认使用 `crun` 作为底层 OCI 运行时，因此需要在 `\u002Fetc\u002Fnvidia-container-runtime\u002Fconfig.toml` 文件中将 `crun` 添加到 `nvidia-container-runtime` 的运行时列表中：\n\n```toml\n[nvidia-container-runtime]\nruntimes = [\"crun\", \"docker-runc\", \"runc\"]\n```\n\n然后重启 `CRI-O`：\n\n```shell\nsudo systemctl restart crio\n```\n\n### 在 Kubernetes 中启用 GPU 支持\n\n在集群中的所有 GPU 节点上完成上述配置后，可以通过部署以下 DaemonSet 来启用 GPU 支持：\n\n```shell\nkubectl create -f https:\u002F\u002Fraw.githubusercontent.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fv0.17.1\u002Fdeployments\u002Fstatic\u002Fnvidia-device-plugin.yml\n```\n\n**注意：** 这是一个简单的静态 DaemonSet，旨在演示 `nvidia-device-plugin` 的基本功能。在生产环境中部署插件时，请参阅下方关于[通过 Helm 部署](#deployment-via-helm)的说明。\n\n### 运行 GPU 作业\n\n部署 DaemonSet 后，容器现在可以使用 `nvidia.com\u002Fgpu` 资源类型来请求 NVIDIA GPU：\n\n```shell\ncat \u003C\u003CEOF | kubectl apply -f -\napiVersion: v1\nkind: Pod\nmetadata:\n  name: gpu-pod\nspec:\n  restartPolicy: Never\n  containers:\n    - name: cuda-container\n      image: nvcr.io\u002Fnvidia\u002Fk8s\u002Fcuda-sample:vectoradd-cuda12.5.0\n      resources:\n        limits:\n          nvidia.com\u002Fgpu: 1 # 请求 1 个 GPU\n  tolerations:\n  - key: nvidia.com\u002Fgpu\n    operator: Exists\n    effect: NoSchedule\nEOF\n```\n\n```shell\n$ kubectl logs gpu-pod\n[向量加法，共 50000 个元素]\n将输入数据从主机内存复制到 CUDA 设备\nCUDA 内核启动，包含 196 个线程块，每块 256 个线程\n将输出数据从 CUDA 设备复制回主机内存\n测试通过\n完成\n```\n\n> [!WARNING]  \n> 如果在使用设备插件时未请求 GPU，插件会将机器上的所有 GPU 暴露给容器。\n\n## 配置 NVIDIA 设备插件二进制文件\n\nNVIDIA 设备插件具有多个可配置选项。这些选项可以通过命令行标志、环境变量或在启动设备插件时使用配置文件进行配置。下面我们将解释每个选项的具体含义以及如何直接对插件二进制文件进行配置。接下来的部分将说明如何通过 `helm` 部署插件时设置这些配置。\n\n### 作为命令行标志或环境变量\n\n| 标志                     | 环境变量    | 默认值   |\n|--------------------------|-------------|----------|\n| `--mig-strategy`         | `$MIG_STRATEGY`         | `\"none\"`        |\n| `--fail-on-init-error`   | `$FAIL_ON_INIT_ERROR`   | `true`          |\n| `--nvidia-driver-root`   | `$NVIDIA_DRIVER_ROOT`   | `\"\u002F\"`           |\n| `--pass-device-specs`    | `$PASS_DEVICE_SPECS`    | `false`         |\n| `--device-list-strategy` | `$DEVICE_LIST_STRATEGY` | `\"envvar\"`      |\n| `--device-id-strategy`   | `$DEVICE_ID_STRATEGY`   | `\"uuid\"`        |\n| `--config-file`          | `$CONFIG_FILE`          | `\"\"`            |\n\n### 作为配置文件\n\n```yaml\nversion: v1\nflags:\n  migStrategy: \"none\"\n  failOnInitError: true\n  nvidiaDriverRoot: \"\u002F\"\n  plugin:\n    passDeviceSpecs: false\n    deviceListStrategy: \"envvar\"\n    deviceIDStrategy: \"uuid\"\n```\n\n**注意：** 配置文件中有一个明确的 `plugin` 部分，因为它是设备插件与 [`gpu-feature-discovery`](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fgpu-feature-discovery) 共享的配置。`plugin` 部分内的所有选项仅适用于设备插件，而该部分之外的选项则是共享的。\n\n### 配置选项详情\n\n**`MIG_STRATEGY`**:\n  用于在支持 MIG 的 GPU 上暴露 MIG 设备的所需策略\n\n  `[none | single | mixed] (默认 'none')`\n\n  `MIG_STRATEGY` 选项配置守护进程集，使其能够在支持 MIG 的 GPU 上暴露多实例 GPU (MIG)。有关这些策略的具体内容及其使用方式的更多信息，请参阅 [在 Kubernetes 中支持多实例 GPU (MIG)](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1mdgMQ8g7WmaI_XVVRrCvHPFPOMCm5LQD5JefgAh6N8g)。\n\n  **注意：** 当 `MIG_STRATEGY` 设置为 `mixed` 时，您将获得额外的资源，格式为 `nvidia.com\u002Fmig-\u003Cslice_count>g.\u003Cmemory_size>gb`，您可以在 Pod 规范中设置这些资源，以访问特定的 MIG 设备。\n\n**`FAIL_ON_INIT_ERROR`**:\n  如果初始化过程中遇到错误，则使插件失败；否则无限期阻塞\n\n  `(默认 'true')`\n\n  当设置为 `true` 时，如果初始化过程中遇到错误，`FAIL_ON_INIT_ERROR` 选项会使插件失败。当设置为 `false` 时，它会打印错误信息，并使插件无限期阻塞，而不是失败。无限期阻塞遵循旧版语义，允许插件成功部署到没有 GPU 的节点上（且本不应配备 GPU 的节点），而不会抛出错误。这样，您可以盲目地将带有该插件的守护进程集部署到集群中的所有节点，无论它们是否配备 GPU，都不会遇到错误。然而，这样做意味着无法检测到本应配备 GPU 的节点上是否存在实际错误。现在，默认行为是在初始化错误时失败，所有新部署都应采用此设置。\n\n**`NVIDIA_DRIVER_ROOT`**:\n  NVIDIA 驱动程序安装的根路径\n\n  `(默认 '\u002F')`\n\n  当 NVIDIA 驱动程序直接安装在主机上时，应将其设置为 `'\u002F'`。当驱动程序安装在其他位置时（例如通过驱动容器），则应将其设置为驱动程序安装的根文件系统路径（例如 `'\u002Frun\u002Fnvidia\u002Fdriver'`）。\n\n  **注意：** 此选项仅在与下面描述的 `$PASS_DEVICE_SPECS` 选项结合使用时才需要。它指示插件在作为设备规格返回的任何设备文件路径前添加哪个前缀。\n\n**`PASS_DEVICE_SPECS`**:\n  将分配给容器的任何 NVIDIA 设备的路径和所需的设备节点权限传递给插件\n\n  `(默认 'false')`\n\n  此选项的存在仅是为了使设备插件能够与 Kubernetes 中的 `CPUManager` 互操作。设置此标志还需要以提升的权限部署守护进程集，因此只有在确实需要与 `CPUManager` 互操作时才应启用。\n\n**`DEVICE_LIST_STRATEGY`**:\n  将设备列表传递给底层运行时的所需策略\n\n  `[envvar | volume-mounts | cdi-annotations | cdi-cri ] (默认 'envvar')`\n\n  **注意**：可以指定多种设备列表策略（以逗号分隔的列表形式）。\n\n  `DEVICE_LIST_STRATEGY` 标志允许用户选择插件用于通告分配给容器的 GPU 列表所采用的策略。可能的值包括：\n\n  - `envvar`（默认）：使用 [此处](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fnvidia-container-runtime#nvidia_visible_devices) 描述的 `NVIDIA_VISIBLE_DEVICES` 环境变量来选择由 NVIDIA 容器运行时注入的设备。\n  - `volume-mounts`：将设备列表作为一组卷挂载传递，而不是作为环境变量传递，以指示 NVIDIA 容器运行时注入这些设备。有关此策略背后的原因的详细信息，请参阅 [此处](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1uXVF-NWZQXgP1MLb87_kMkQvidpnkNWicdpO2l9g-fw\u002Fedit#heading=h.b3ti65rojfy5)。\n  - `cdi-annotations`：使用 CDI 注解来选择要注入的设备。请注意，这不需要 NVIDIA 容器运行时，但需要支持 CDI 的容器引擎。\n  - `cdi-cri`：使用 CRI 字段 `CDIDevices` 来选择要注入的 CDI 设备。这需要 Kubernetes 支持将这些请求通过 CRI 转发到支持 CDI 的容器引擎。\n\n**`DEVICE_ID_STRATEGY`**:\n  将设备 ID 传递给底层运行时的所需策略\n\n  `[uuid | index] (默认 'uuid')`\n\n  `DEVICE_ID_STRATEGY` 标志允许用户选择插件用于传递分配给容器的 GPU 设备 ID 所采用的策略。传统上，设备 ID 是以 GPU 的 UUID 形式传递的。此标志允许用户决定是使用 GPU 的 UUID 还是 GPU 的索引（如 `nvidia-smi` 输出所示）作为传递给底层运行时的标识符。在某些情况下，例如当由插件分配了 GPU 的 Pod 重启后连接到不同的物理 GPU 时，使用索引可能是更合适的选择。\n\n**`CONFIG_FILE`**:\n  让插件指向一个配置文件，而不是依赖命令行参数或环境变量\n\n  `(默认 '')`\n\n  每个选项的优先级顺序是：(1) 命令行参数，(2) 环境变量，(3) 配置文件。这样，用户可以使用预定义的配置文件，然后在启动时覆盖其中的值。如下所述，可以通过 `helm` 部署时使用 `ConfigMap` 将插件指向所需的配置文件。\n\n### GPU 共享访问\n\nNVIDIA 设备插件可通过其配置文件中的一组扩展选项实现 GPU 的超量分配。有两种共享模式可供选择：时间片轮转和 MPS。\n\n> [!NOTE]\n> 时间片轮转和 MPS 是互斥的。\n\n在时间片轮转模式下，CUDA 时间片技术用于让共享同一 GPU 的工作负载相互交错执行。然而，对于从同一 GPU 分配副本的工作负载，并未采取特殊隔离措施，每个工作负载都可以访问 GPU 内存，并且与所有其他工作负载处于相同的故障域中（这意味着如果其中一个工作负载崩溃，所有工作负载都会崩溃）。\n\n在 MPS 模式下，使用一个控制守护进程来管理对共享 GPU 的访问。与时间片轮转不同，MPS 实施空间分区，允许显式划分内存和计算资源，并对每个工作负载强制实施这些限制。\n\n无论是时间片轮转还是 MPS，同一种共享方法都会应用于节点上的所有 GPU。您无法按单个 GPU 配置共享策略。\n\n#### 使用 CUDA 时间片轮转\n\n使用时间片轮转进行共享的扩展选项如下所示：\n\n```yaml\nversion: v1\nsharing:\n  timeSlicing:\n    renameByDefault: \u003Cbool>\n    failRequestsGreaterThanOne: \u003Cbool>\n    resources:\n    - name: \u003Cresource-name>\n      replicas: \u003Cnum-replicas>\n    ...\n```\n\n也就是说，对于 `sharing.timeSlicing.resources` 下的每一种命名资源，现在可以为该资源类型指定副本数。这些副本代表将为该资源类型所表示的 GPU 分配的共享访问次数。\n\n如果 `renameByDefault=true`，那么每种资源将以 `\u003Cresource-name>.shared` 的名称进行通告，而不是简单地使用 `\u003Cresource-name>`。\n\n如果 `failRequestsGreaterThanOne=true`，则当容器请求超过一个共享资源时，插件会拒绝为其分配任何共享资源。该容器的 Pod 将因 `UnexpectedAdmissionError` 而失败，需要手动删除、更新并重新部署。\n\n例如：\n\n```yaml\nversion: v1\nsharing:\n  timeSlicing:\n    resources:\n    - name: nvidia.com\u002Fgpu\n      replicas: 10\n```\n\n如果将此配置应用于一台拥有 8 块 GPU 的节点，插件现在将向 Kubernetes 宣告 80 个 `nvidia.com\u002Fgpu` 资源，而不是 8 个。\n\n```shell\n$ kubectl describe node\n...\nCapacity:\n  nvidia.com\u002Fgpu: 80\n...\n```\n\n同样，如果应用以下配置到节点上，则会向 Kubernetes 宣告 80 个 `nvidia.com\u002Fgpu.shared` 资源，而不是 8 个 `nvidia.com\u002Fgpu` 资源。\n\n```yaml\nversion: v1\nsharing:\n  timeSlicing:\n    renameByDefault: true\n    resources:\n    - name: nvidia.com\u002Fgpu\n      replicas: 10\n    ...\n```\n\n```shell\n$ kubectl describe node\n...\nCapacity:\n  nvidia.com\u002Fgpu.shared: 80\n...\n```\n\n在这两种情况下，插件都会简单地为每块 GPU 创建 10 个引用，并无差别地将其分配给所有提出请求的用户。\n\n如果在上述任一配置中设置了 `failRequestsGreaterThanOne=true`，而用户在其 Pod 规范中请求超过一个 `nvidia.com\u002Fgpu` 或 `nvidia.com\u002Fgpu.shared` 资源，则该容器将失败，并出现如下错误：\n\n```shell\n$ kubectl describe pod gpu-pod\n...\nEvents:\n  Type     Reason                    Age   From               Message\n  ----     ------                    ----  ----               -------\n  Warning  UnexpectedAdmissionError  13s   kubelet            Allocate failed due to rpc error: code = Unknown desc = request for 'nvidia.com\u002Fgpu: 2' too large: maximum request size for shared resources is 1, which is unexpected\n...\n```\n\n**注意：** 与“普通”GPU 请求不同，请求多个共享GPU 并不意味着您将获得按比例分配的计算能力。它仅仅表示您将获得与其他客户端共享的 GPU（每个客户端都可以在其底层 GPU 上自由运行任意数量的进程）。在底层，CUDA 会简单地为所有客户端的所有 GPU 进程平均分配时间片。`failRequestsGreaterThanOne` 标志旨在帮助用户理解这一微妙之处，通过将 `1` 的请求视为访问请求，而非独占资源请求。建议设置 `failRequestsGreaterThanOne=true`，但默认值为 `false`，以保持向后兼容性。\n\n截至目前，唯一支持时间切片的资源是 `nvidia.com\u002Fgpu`，以及通过采用混合 MIG 策略配置节点后产生的任何资源类型。\n\n例如，T4 卡上的所有可时间切片资源为：\n\n```\nnvidia.com\u002Fgpu\n```\n\n而 A100 40GB 卡上的所有可时间切片资源为：\n\n```\nnvidia.com\u002Fgpu\nnvidia.com\u002Fmig-1g.5gb\nnvidia.com\u002Fmig-2g.10gb\nnvidia.com\u002Fmig-3g.20gb\nnvidia.com\u002Fmig-7g.40gb\n```\n\n同样，在 A100 80GB 卡上，这些资源将是：\n\n```\nnvidia.com\u002Fgpu\nnvidia.com\u002Fmig-1g.10gb\nnvidia.com\u002Fmig-2g.20gb\nnvidia.com\u002Fmig-3g.40gb\nnvidia.com\u002Fmig-7g.80gb\n```\n\n#### 使用 CUDA MPS\n\n> [!WARNING]\n> 自设备插件 v0.15.0 版本起，MPS 支持被视为实验性功能。有关详细信息，请参阅 [发行说明](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Freleases\u002Ftag\u002Fv0.15.0)。\n\n> [!NOTE]\n> 目前，在启用了 MIG 的设备上不支持使用 MPS 进行资源共享。\n\n使用 MPS 进行资源共享的扩展选项如下所示：\n\n```yaml\nversion: v1\nsharing:\n  mps:\n    renameByDefault: \u003Cbool>\n    resources:\n    - name: \u003Cresource-name>\n      replicas: \u003Cnum-replicas>\n    ...\n```\n\n也就是说，对于 `sharing.mps.resources` 下的每一种命名资源，可以为该资源类型指定副本数。与时间切片类似，这些副本代表将为与该资源类型关联的 GPU 分配的共享访问次数。然而，与时间切片不同，每个客户端（即每个分区）允许使用的内存量由 MPS 控制守护进程管理，并被限制为总设备内存的相等份额。除了控制每个客户端可消耗的内存量外，MPS 控制守护进程还会限制客户端可消耗的计算能力。\n\n如果 `renameByDefault=true`，那么每种资源将以 `\u003Cresource-name>.shared` 的名称进行通告，而不是简单地使用 `\u003Cresource-name>`。\n\n例如：\n\n```yaml\nversion: v1\nsharing:\n  mps:\n    resources:\n    - name: nvidia.com\u002Fgpu\n      replicas: 10\n```\n\n如果将此配置应用于一台拥有 8 块 GPU 的节点，插件现在将向 Kubernetes 宣告 80 个 `nvidia.com\u002Fgpu` 资源，而不是 8 个。\n\n```shell\n$ kubectl describe node\n...\nCapacity:\n  nvidia.com\u002Fgpu: 80\n...\n```\n\n同样，如果应用以下配置到节点上，则会向 Kubernetes 宣告 80 个 `nvidia.com\u002Fgpu.shared` 资源，而不是 8 个 `nvidia.com\u002Fgpu` 资源。\n\n```yaml\nversion: v1\nsharing:\n  mps:\n    renameByDefault: true\n    resources:\n    - name: nvidia.com\u002Fgpu\n      replicas: 10\n    ...\n```\n\n```shell\n$ kubectl describe node\n...\nCapacity:\n  nvidia.com\u002Fgpu.shared: 80\n...\n```\n\n此外，这些资源——无论是 `nvidia.com\u002Fgpu` 还是 `nvidia.com\u002Fgpu.shared`——都将访问 GPU 总内存和计算资源的相同份额（1\u002F10）。\n\n**注意**：截至目前，MPS 仅支持 `nvidia.com\u002Fgpu` 资源，且仅限于完整的 GPU。\n\n### IMEX 支持\n\nNVIDIA GPU 设备插件可以配置为将 IMEX 通道注入到工作负载中。\n\n此可选行为是全局性的，会影响所有工作负载，并由 `imex.channelIDs` 和 `imex.required` 配置选项控制。\n\n| `imex.channelIDs` | `imex.required` | 效果 |\n|---|---|---|\n| `[]` | * | （默认）不会向工作负载请求添加任何 IMEX 通道。请注意，在这种情况下，`imex.required` 字段不起作用 |\n| `[0]` | `false` | 如果 NVIDIA GPU 设备插件能够发现所请求的 IMEX 通道（`0`），则该通道会被添加到每个工作负载请求中。如果无法发现该通道，则不会向工作负载请求添加任何通道。 |\n| `[0]` | `true` | 如果 NVIDIA GPU 设备插件能够发现所请求的 IMEX 通道（`0`），则该通道会被添加到每个工作负载请求中。如果无法发现该通道，由于该通道被标记为“必需”，将会引发错误。 |\n\n**注意**：目前有效的 `imex.channelIDs` 配置只有 `[]` 和 `[0]`。\n\n为了使运行中的容器化 NVIDIA GPU 设备插件能够成功发现可用的 IMEX 通道，相应的设备节点必须对容器可见。\n\n## 标签目录\n\nNVIDIA 设备插件会读取和写入多种标签，这些标签既可用作配置元素，也可用作信息性元素。下表记录并描述了每个标签及其用途。有关 GFD 添加的标签，请参阅相关表格 [此处](\u002Fdocs\u002Fgpu-feature-discovery\u002FREADME.md#generated-labels)。\n\n| 标签名                          | 描述                                                                                                                                                                                                                                  | 示例        |\n| ----------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------- |\n| nvidia.com\u002Fdevice-plugin.config     | 指定应用于节点的配置。通过应用此标签可以进行节点级别的配置。有关详细信息，请参阅 [使用节点标签更新节点级配置](#updating-per-node-configuration-with-a-node-label)。 | my-mps-config  |\n| nvidia.com\u002Fgpu.sharing-strategy     | 指定共享策略。默认值为 `none`，表示不共享。其他值包括 `mps` 和 `time-slicing`。                                                                                                                 | time-slicing   |\n| nvidia.com\u002Fmig.capable              | 指定节点上的任何设备是否支持 MIG。                                                                                                                                                                                            | false          |\n| nvidia.com\u002Fmps.capable              | 指定节点上的设备是否已配置为 MPS。                                                                                                                                                                                     | false          |\n| nvidia.com\u002Fvgpu.present             | 指定节点上的设备是否使用 vGPU。                                                                                                                                                                                                   | false          |\n| nvidia.com\u002Fvgpu.host-driver-branch  | 指定底层虚拟机管理程序上的 vGPU 主机驱动程序分支。                                                                                                                                                                          | r550_40        |\n| nvidia.com\u002Fvgpu.host-driver-version | 指定底层虚拟机管理程序上的 vGPU 主机驱动程序版本。                                                                                                                                                                         | 550.54.16      |\n\n## 通过 `helm` 部署\n\n部署设备插件的首选方法是使用 `helm` 以守护进程集的形式进行部署。安装 `helm` 的说明可以在此处找到：\n[这里](https:\u002F\u002Fhelm.sh\u002Fdocs\u002Fintro\u002Finstall\u002F)。\n\n首先设置插件的 `helm` 仓库并按如下方式更新：\n\n```shell\nhelm repo add nvdp https:\u002F\u002Fnvidia.github.io\u002Fk8s-device-plugin\nhelm repo update\n```\n\n然后验证插件的最新版本（`v0.17.1`）是否可用：\n\n```shell\n$ helm search repo nvdp --devel\nNAME                     \t  CHART VERSION  APP VERSION\tDESCRIPTION\nnvdp\u002Fnvidia-device-plugin\t  0.17.1\t 0.17.1\t\tA Helm chart for ...\n```\n\n一旦该仓库更新完毕，您就可以开始从其中安装软件包来部署 `nvidia-device-plugin` Helm 图表。\n\n最基本的安装命令（不带任何选项）如下：\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --version 0.17.1\n```\n\n**注意**：只有当这是预发布版本（例如 `\u003Cversion>-rc.1`）时，才需要在 `helm search repo` 中传递 `--devel` 标志，以及在 `helm upgrade -i` 中传递 `--version` 标志。正式发布的版本将不会显示此标志。\n\n### 配置设备插件的 `helm` 图表\n\n适用于插件最新版本（`v0.17.1`）的 `helm` 图表包含许多可自定义的值。\n\n在 `v0.12.0` 之前，最常用的值是与插件二进制文件的命令行选项直接对应的那些。自 `v0.12.0` 起，设置这些选项的首选方法是通过 `ConfigMap`。原始值的主要用途是在需要时覆盖 `ConfigMap` 中的某个选项。以下将更详细地讨论这两种方法。\n\n可设置的完整值列表请见：\n[此处](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fblob\u002Fv0.17.1\u002Fdeployments\u002Fhelm\u002Fnvidia-device-plugin\u002Fvalues.yaml)。\n\n#### 通过 `ConfigMap` 向插件传递配置\n\n通常，我们提供一种机制，可以将 _多个_ 配置文件传递给插件的 `helm` 图表，并且可以通过节点标签选择应将哪个配置文件应用于特定节点。\n\n这样，可以使用单个图表部署各个组件，但可以在集群中的不同节点上应用自定义配置。\n\n有两种方法可以为插件提供 `ConfigMap`：\n\n  1. 通过对外部预定义 `ConfigMap` 的引用\n  1. 作为一组命名的配置文件，构建与图表关联的集成 `ConfigMap`\n\n这些可以通过图表值 `config.name` 和 `config.map` 分别进行设置。\n在这两种情况下，都可以将 `config.default` 设置为指向 `ConfigMap` 中的一个命名配置，\n从而为未通过节点标签自定义的节点提供默认配置（稍后会详细介绍）。\n\n##### 单个配置文件示例\n\n例如，在本地文件系统上创建一个有效的配置文件，如下所示：\n\n```shell\ncat \u003C\u003C EOF > \u002Ftmp\u002Fdp-example-config0.yaml\nversion: v1\nflags:\n  migStrategy: \"none\"\n  failOnInitError: true\n  nvidiaDriverRoot: \"\u002F\"\n  plugin:\n    passDeviceSpecs: false\n    deviceListStrategy: envvar\n    deviceIDStrategy: uuid\nEOF\n```\n\n然后通过 Helm 部署设备插件（指向此配置文件并为其命名）：\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set-file config.map.config=\u002Ftmp\u002Fdp-example-config0.yaml\n```\n\n在后台，这将部署与插件关联的 `ConfigMap`，并将 `dp-example-config0.yaml` 文件的内容放入其中，\n以 `config` 作为键。随后启动插件，使该配置在插件上线时生效。\n\n如果不想让插件的 Helm Chart 自动为您创建 `ConfigMap`，也可以将其指向预先创建的 `ConfigMap`，如下所示：\n\n```shell\nkubectl create ns nvidia-device-plugin\n```\n\n```shell\nkubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \\\n  --from-file=config=\u002Ftmp\u002Fdp-example-config0.yaml\n```\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set config.name=nvidia-plugin-configs\n```\n\n##### 多个配置文件示例\n\n对于多个配置文件，操作步骤类似。\n\n创建第二个配置文件，内容如下：\n\n```shell\ncat \u003C\u003C EOF > \u002Ftmp\u002Fdp-example-config1.yaml\nversion: v1\nflags:\n  migStrategy: \"mixed\" # 仅与 config0.yaml 不同\n  failOnInitError: true\n  nvidiaDriverRoot: \"\u002F\"\n  plugin:\n    passDeviceSpecs: false\n    deviceListStrategy: envvar\n    deviceIDStrategy: uuid\nEOF\n```\n\n然后再次通过 Helm 部署设备插件（同时指向两个配置文件，并指定默认配置）：\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set config.default=config0 \\\n  --set-file config.map.config0=\u002Ftmp\u002Fdp-example-config0.yaml \\\n  --set-file config.map.config1=\u002Ftmp\u002Fdp-example-config1.yaml\n```\n\n同样地，如果需要，也可以使用预先创建的 `ConfigMap` 来完成此操作：\n\n```shell\nkubectl create ns nvidia-device-plugin\n```\n\n```shell\nkubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \\\n  --from-file=config0=\u002Ftmp\u002Fdp-example-config0.yaml \\\n  --from-file=config1=\u002Ftmp\u002Fdp-example-config1.yaml\n```\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set config.default=config0 \\\n  --set config.name=nvidia-plugin-configs\n```\n\n**注意：** 如果未显式设置 `config.default` 标志，则会在其中一个配置名称设置为 `'default'` 时，\n从该配置中推断出默认值。如果两者均未设置，则部署将失败，除非仅提供 **_一个 _** 配置文件。\n在这种情况下，由于没有其他选择，系统会自动将该配置选为默认配置。\n\n##### 使用节点标签更新每个节点的配置\n\n在此设置下，所有节点上的插件默认都会配置为使用 `config0`。然而，可以通过设置以下标签来更改应用的配置：\n\n```shell\nkubectl label nodes \u003Cnode-name> –-overwrite \\\n  nvidia.com\u002Fdevice-plugin.config=\u003Cconfig-name>\n```\n\n例如，为所有安装了 T4 GPU 的节点应用自定义配置，可以执行以下命令：\n\n```shell\nkubectl label node \\\n  --overwrite \\\n  --selector=nvidia.com\u002Fgpu.product=TESLA-T4 \\\n  nvidia.com\u002Fdevice-plugin.config=t4-config\n```\n\n**注意：** 可以在插件启动之前或之后应用此标签，以使所需的配置在节点上生效。\n每当该标签的值发生变化时，插件会立即更新以使用新的配置。如果设置为未知值，则会跳过重新配置。\n如果该标签被取消设置，则会回退到默认配置。\n\n#### 设置其他 Helm Chart 值\n\n如前所述，设备插件的 Helm Chart 仍然提供了直接用于设置插件配置选项的值，\n而无需使用 `ConfigMap`。这些值仅应用于设置全局适用的选项（这些选项不应再嵌入到由 `ConfigMap` 提供的配置文件集中），\n或者根据需要覆盖这些选项。\n\n这些值如下：\n\n```yaml\n  migStrategy:\n      在支持 MIG 的 GPU 上暴露 MIG 设备时所采用的策略\n      [none | single | mixed]（默认值为“none”）\n  failOnInitError:\n      如果初始化过程中遇到错误则使插件失败，否则无限期阻塞\n      （默认值为‘true’）\n  compatWithCPUManager:\n      以提升的权限运行，以兼容静态 CPUManager 策略\n      （默认值为‘false’）\n  deviceListStrategy:\n      向底层运行时传递设备列表时所采用的策略\n      [envvar | volume-mounts | cdi-annotations | cdi-cri]（默认值为“envvar”）\n  deviceIDStrategy:\n      向底层运行时传递设备 ID 时所采用的策略\n      [uuid | index]（默认值为“uuid”）\n  nvidiaDriverRoot:\n      NVIDIA 驱动程序安装的根路径（常见值为 '\u002F' 或 '\u002Frun\u002Fnvidia\u002Fdriver'）\n```\n\n**注意：** 没有直接映射到插件 `PASS_DEVICE_SPECS` 配置选项的值。取而代之的是提供了一个名为 `compatWithCPUManager` 的值，\n它既将插件的 `PASS_DEVICE_SPECS` 选项设置为真，又确保插件以提升的权限启动，\n以保证与 `CPUManager` 的良好兼容性。\n\n除了这些针对插件的自定义配置选项之外，通常还会覆盖的其他标准 Helm Chart 值包括：\n\n```yaml\nruntimeClassName:\n  要使用的 runtimeClassName，适用于具有多个运行时的集群。（典型值为‘nvidia’）\n```\n\n请查看\n[`values.yaml`](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fblob\u002Fv0.17.1\u002Fdeployments\u002Fhelm\u002Fnvidia-device-plugin\u002Fvalues.yaml)\n文件，以了解设备插件的所有可覆盖参数。\n\n设置这些选项的示例包括：\n\n启用与 `CPUManager` 的兼容性，并请求 100 毫秒的 CPU 时间以及 512MB 的内存限制。\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set compatWithCPUManager=true \\\n  --set resources.requests.cpu=100m \\\n  --set resources.limits.memory=512Mi\n```\n\n启用与 `CPUManager` 和 `mixed` `migStrategy` 的兼容性。\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set compatWithCPUManager=true \\\n  --set migStrategy=mixed\n```\n\n#### 使用 gpu-feature-discovery 自动添加节点标签进行部署\n\n自 `v0.12.0` 起，设备插件的 Helm Chart 已集成对\n[`gpu-feature-discovery`](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fgpu-feature-discovery)\n(GFD) 的支持。您可以使用 GFD 自动为节点上可用的 GPU 集合生成标签。其底层利用了 [Node Feature Discovery](https:\u002F\u002Fkubernetes-sigs.github.io\u002Fnode-feature-discovery\u002Fstable\u002Fget-started\u002Findex.html) 来完成这一标签化工作。\n\n要启用此功能，只需在 Helm 安装时设置 `gfd.enabled=true` 即可。\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version=0.17.1 \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  --set gfd.enabled=true\n```\n\n在后台，这也会部署 [`node-feature-discovery`](https:\u002F\u002Fgithub.com\u002Fkubernetes-sigs\u002Fnode-feature-discovery)\n(NFD)，因为它是 GFD 的先决条件。如果您集群中已经部署了 NFD，并且不希望此次安装将其引入，可以通过设置 `nfd.enabled=false` 来禁用它。\n\n除了 GFD 应用的标准节点标签外，在使用上述描述的分时或 MPS 扩展部署插件时，还会包含以下标签：\n\n```\nnvidia.com\u002F\u003Cresource-name>.replicas = \u003Cnum-replicas>\n```\n\n此外，如果 `renameByDefault=false`，`nvidia.com\u002F\u003Cresource-name>.product` 将按如下方式修改：\n\n```\nnvidia.com\u002F\u003Cresource-name>.product = \u003Cproduct name>-SHARED\n```\n\n借助这些标签，用户可以像选择不同型号的 GPU 一样，选择共享或非共享的 GPU。也就是说，`SHARED` 注解确保可以使用 `nodeSelector` 将 Pod 吸引到具有共享 GPU 的节点上。\n\n由于将 `renameByDefault=true` 已经在资源名称中编码了该资源是共享的事实，因此无需再通过 `SHARED` 注解来标记产品名称。用户只需在其 Pod 规范中请求所需的共享资源即可找到它们。\n\n注意：当使用 `renameByDefault=false` 并且 `migStrategy=single` 时，MIG 配置文件名和新的 `SHARED` 注解都会附加到产品名称上，例如：\n\n```\nnvidia.com\u002Fgpu.product = A100-SXM4-40GB-MIG-1g.5gb-SHARED\n```\n\n#### 以独立模式部署 gpu-feature-discovery\n\n自 v0.15.0 起，设备插件的 Helm Chart 已集成对\n[`gpu-feature-discovery`](\u002Fdocs\u002Fgpu-feature-discovery\u002FREADME.md#overview) 的支持。\n\n以独立模式部署 gpu-feature-discovery 时，首先需要设置并更新插件的 Helm 仓库，具体步骤如下：\n\n```shell\nhelm repo add nvdp https:\u002F\u002Fnvidia.github.io\u002Fk8s-device-plugin\nhelm repo update\n```\n\n然后验证插件的最新版本（v0.17.1）是否可用（请注意，这包括 GFD Chart）：\n\n```shell\nhelm search repo nvdp --devel\nNAME                     \t  CHART VERSION  APP VERSION\tDESCRIPTION\nnvdp\u002Fnvidia-device-plugin\t  0.17.1\t 0.17.1\t\tA Helm chart for ...\n```\n\n一旦该仓库更新完毕，您就可以开始从其中安装软件包，以独立模式部署 `gpu-feature-discovery` 组件。\n\n最基本的无选项安装命令如下：\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n  --version 0.17.1 \\\n  --namespace gpu-feature-discovery \\\n  --create-namespace \\\n  --set devicePlugin.enabled=false\n```\n\n禁用 NFD 的自动部署，并在默认命名空间中使用混合 MIG 策略运行。\n\n```shell\nhelm upgrade -i nvdp nvdp\u002Fnvidia-device-plugin \\\n    --version=0.17.1 \\\n    --set allowDefaultNamespace=true \\\n    --set nfd.enabled=false \\\n    --set migStrategy=mixed \\\n    --set devicePlugin.enabled=false\n```\n\n**注意：** 只有在使用预发布版本（例如 `\u003Cversion>-rc.1`）时，才需要在 `helm search repo` 中添加 `--devel` 标志，以及在 `helm upgrade -i` 中添加 `--version` 标志。正式发布的版本则无需这些标志。\n\n\n\n### 通过 Helm install 直接使用 Helm 包的 URL 进行部署\n\n如果您不想从 `nvidia-device-plugin` Helm 仓库安装，可以直接针对插件 Helm 包的 tarball 运行 `helm install`。下面的示例安装的是与上述方法相同的 Chart，只是使用 Helm Chart 的直接 URL 而不是通过 Helm 仓库。\n\n使用默认值运行：\n\n```shell\nhelm upgrade -i nvdp \\\n  --namespace nvidia-device-plugin \\\n  --create-namespace \\\n  https:\u002F\u002Fnvidia.github.io\u002Fk8s-device-plugin\u002Fstable\u002Fnvidia-device-plugin-0.17.1.tgz\n```\n\n## 在本地构建和运行\n\n接下来的部分重点介绍如何在本地构建并运行设备插件。这仅适用于开发和测试，大多数用户并不需要。本节假设您正在使用最新的发布标签（即 `v0.17.1`），但也可以轻松修改以适应任何可用的标签或分支。\n\n### 使用 Docker\n\n#### 构建\n\n选项 1：从 [Docker Hub](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fnvidia\u002Fk8s-device-plugin) 拉取预构建的镜像：\n\n```shell\ndocker pull nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:v0.17.1\ndocker tag nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:v0.17.1 nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:devel\n```\n\n选项 2：无需克隆仓库直接构建：\n\n```shell\ndocker build \\\n  -t nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:devel \\\n  -f deployments\u002Fcontainer\u002FDockerfile.ubuntu \\\n  https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin.git#v0.17.1\n```\n\n选项 3：如果您想修改代码：\n\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin.git && cd k8s-device-plugin\ndocker build \\\n  -t nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:devel \\\n  -f deployments\u002Fcontainer\u002FDockerfile.ubuntu \\\n  .\n```\n\n#### 运行\n\n不兼容 `CPUManager` 静态策略时：\n\n```shell\ndocker run \\\n  -it \\\n  --security-opt=no-new-privileges \\\n  --cap-drop=ALL \\\n  --network=none \\\n  -v \u002Fvar\u002Flib\u002Fkubelet\u002Fdevice-plugins:\u002Fvar\u002Flib\u002Fkubelet\u002Fdevice-plugins \\\n  nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:devel\n```\n\n兼容 `CPUManager` 静态策略时：\n\n```shell\ndocker run \\\n  -it \\\n  --privileged \\\n  --network=none \\\n  -v \u002Fvar\u002Flib\u002Fkubelet\u002Fdevice-plugins:\u002Fvar\u002Flib\u002Fkubelet\u002Fdevice-plugins \\\n  nvcr.io\u002Fnvidia\u002Fk8s-device-plugin:devel --pass-device-specs\n```\n\n### 不使用 Docker\n\n#### 构建\n\n```shell\nC_INCLUDE_PATH=\u002Fusr\u002Flocal\u002Fcuda\u002Finclude LIBRARY_PATH=\u002Fusr\u002Flocal\u002Fcuda\u002Flib64 go build\n```\n\n#### 运行\n\n不支持 `CPUManager` 静态策略时：\n\n```shell\n.\u002Fk8s-device-plugin\n```\n\n支持 `CPUManager` 静态策略时：\n\n```shell\n.\u002Fk8s-device-plugin --pass-device-specs\n```\n\n## 更改日志\n\n请参阅 [更改日志](CHANGELOG.md)\n\n## 问题与贡献\n\n[查看贡献文档！](CONTRIBUTING.md)\n\n- 您可以通过 [提交新问题](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fissues\u002Fnew) 来报告 bug。\n- 您也可以通过打开一个 [拉取请求](https:\u002F\u002Fhelp.github.com\u002Farticles\u002Fusing-pull-requests\u002F) 来做出贡献。\n\n### 版本管理\n\n在 v1.10 之前，设备插件的版本号必须与 Kubernetes 的版本号完全一致。随着设备插件被提升为 Beta 状态，这一要求不再适用。然而，我们很快发现这种版本管理方式让用户感到非常困惑，因为他们仍然期望看到每个 Kubernetes 版本对应的设备插件版本。\n\n这种版本管理方式适用于标签 `v1.8`、`v1.9`、`v1.10`、`v1.11`、`v1.12`。\n\n现在，我们已将版本管理改为遵循 [语义化版本控制](https:\u002F\u002Fsemver.org\u002F)。首个按照此方案标记的版本为 `v0.0.0`。\n\n今后，设备插件的主版本号仅会在设备插件 API 发生变化时才会更新。例如，设备插件 API 的 `v1beta1` 版本对应于设备插件的 `v0.x.x` 版本。如果设备插件 API 推出新的 `v2beta2` 版本，那么设备插件的主版本号将升级到 `1.x.x`。\n\n截至目前，Kubernetes >= v1.10 的设备插件 API 是 `v1beta1`。如果您使用的 Kubernetes 版本大于等于 1.10，您可以部署任何大于 `v0.0.0` 的设备插件版本。\n\n### 在使用设备插件的情况下升级 Kubernetes\n\n在部署了设备插件的情况下升级 Kubernetes 并不需要对您的工作流程进行任何特殊调整。API 已经进行了版本管理，并且相当稳定（尽管不能保证完全向后兼容）。从 Kubernetes 1.10 开始，您可以使用设备插件的 `v0.3.0` 版本来执行升级，而无需部署不同版本的设备插件。节点在升级完成后重新上线时，GPU 将会自动重新注册。\n\n然而，升级设备插件本身则是一项更为复杂的任务。建议您先驱逐 GPU 任务，因为我们无法保证 GPU 任务能够在滚动升级过程中正常运行。不过，我们会尽最大努力在升级过程中保留 GPU 任务。","# NVIDIA k8s-device-plugin 快速上手指南\n\n本指南帮助您在 Kubernetes 集群中快速部署 NVIDIA GPU 支持，使容器能够请求和使用 GPU 资源。\n\n## 环境准备\n\n在开始之前，请确保您的集群节点满足以下前置条件：\n\n*   **操作系统**: Linux (推荐 Debian\u002FUbuntu 或 CentOS\u002FRHEL)\n*   **NVIDIA 驱动**: 版本 ~= 384.81 或更高\n*   **容器运行时工具**:\n    *   `nvidia-container-toolkit` >= 1.7.0\n    *   若需在 Tegra 系统上使用集成 GPU，需 >= 1.11.0\n*   **Kubernetes**: 版本 >= 1.10\n*   **运行时配置**: 必须将 `nvidia-container-runtime` 配置为默认的低层级运行时。\n\n> **注意**：以下步骤需要在集群中**所有**包含 GPU 的节点上执行。\n\n### 安装 NVIDIA Container Toolkit\n\n以 Debian\u002FUbuntu 系统为例，安装并配置 toolkit：\n\n1.  **添加仓库密钥和源**：\n    ```bash\n    curl -fsSL https:\u002F\u002Fnvidia.github.io\u002Flibnvidia-container\u002Fgpgkey | sudo gpg --dearmor -o \u002Fusr\u002Fshare\u002Fkeyrings\u002Fnvidia-container-toolkit.gpg\n    curl -s -L https:\u002F\u002Fnvidia.github.io\u002Flibnvidia-container\u002Fstable\u002Fdeb\u002Fnvidia-container-toolkit.list | \\\n      sed 's#deb https:\u002F\u002F#deb [signed-by=\u002Fusr\u002Fshare\u002Fkeyrings\u002Fnvidia-container-toolkit.gpg] https:\u002F\u002F#g' | \\\n      sudo tee \u002Fetc\u002Fapt\u002Fsources.list.d\u002Fnvidia-container-toolkit.list\n    ```\n\n2.  **安装工具包**：\n    ```bash\n    sudo apt-get update\n    sudo apt-get install -y nvidia-container-toolkit\n    ```\n\n3.  **配置容器运行时**：\n    根据您使用的运行时（docker, containerd 或 CRI-O）运行配置命令。\n    \n    *针对 `containerd` (推荐):*\n    ```bash\n    sudo nvidia-ctk runtime configure --runtime=containerd --set-as-default\n    sudo systemctl restart containerd\n    ```\n\n    *针对 `CRI-O`:*\n    ```bash\n    sudo nvidia-ctk runtime configure --runtime=crio --set-as-default\n    sudo systemctl restart crio\n    ```\n    *(注：若使用 CRI-O，还需确保 `\u002Fetc\u002Fnvidia-container-runtime\u002Fconfig.toml` 中包含 `runtimes = [\"crun\", \"runc\"]`)*\n\n    *针对 `docker` (已弃用但仍可用):*\n    ```bash\n    sudo nvidia-ctk runtime configure --runtime=docker --set-as-default\n    sudo systemctl restart docker\n    ```\n\n## 安装步骤\n\n确认所有 GPU 节点已完成上述配置后，即可在 Kubernetes 中部署 Device Plugin。\n\n### 方式一：使用静态 YAML 部署（快速测试）\n\n这是最简单的部署方式，适用于快速验证功能：\n\n```bash\nkubectl create -f https:\u002F\u002Fraw.githubusercontent.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fv0.17.1\u002Fdeployments\u002Fstatic\u002Fnvidia-device-plugin.yml\n```\n\n> **提示**：生产环境建议使用 Helm 进行部署以便更好地管理配置，详见官方文档。\n\n### 方式二：验证部署\n\n部署完成后，插件会以 DaemonSet 形式运行。检查状态：\n\n```bash\nkubectl get pods -n kube-system -l app=nvidia-device-plugin\n```\n\n确保所有 Pod 状态为 `Running`。\n\n## 基本使用\n\n部署成功后，您可以在 Pod 定义中通过 `nvidia.com\u002Fgpu` 资源类型来请求 GPU。\n\n### 运行一个简单的 GPU 任务\n\n创建一个名为 `gpu-pod.yaml` 的文件或直接运行以下命令：\n\n```yaml\ncat \u003C\u003CEOF | kubectl apply -f -\napiVersion: v1\nkind: Pod\nmetadata:\n  name: gpu-pod\nspec:\n  restartPolicy: Never\n  containers:\n    - name: cuda-container\n      image: nvcr.io\u002Fnvidia\u002Fk8s\u002Fcuda-sample:vectoradd-cuda12.5.0\n      resources:\n        limits:\n          nvidia.com\u002Fgpu: 1 # 请求 1 块 GPU\n  tolerations:\n  - key: nvidia.com\u002Fgpu\n    operator: Exists\n    effect: NoSchedule\nEOF\n```\n\n### 查看运行结果\n\n等待 Pod 运行完成并查看日志：\n\n```bash\nkubectl logs gpu-pod\n```\n\n如果配置正确，您将看到类似以下的输出，表明 CUDA 程序已成功在 GPU 上运行：\n\n```text\n[Vector addition of 50000 elements]\nCopy input data from the host memory to the CUDA device\nCUDA kernel launch with 196 blocks of 256 threads\nCopy output data from the CUDA device to the host memory\nTest PASSED\nDone\n```\n\n> **警告**：如果在 Pod 中没有设置 `resources.limits.nvidia.com\u002Fgpu`，Device Plugin 默认会将节点上的**所有** GPU 暴露给该容器。务必明确指定所需的 GPU 数量。","某大型金融科技公司正在构建基于 Kubernetes 的实时反欺诈系统，需要在集群中调度数百个容器来并行运行依赖 NVIDIA GPU 的深度推理模型。\n\n### 没有 k8s-device-plugin 时\n- **资源不可见**：Kubernetes 调度器完全“看不见”节点上的物理 GPU，无法感知哪些机器具备加速能力，导致任务只能随机分配或需人工硬编码指定节点。\n- **手动配置繁琐**：运维人员必须在每个 Pod 的 YAML 文件中手动挂载复杂的宿主机的驱动目录和设备文件（如 `\u002Fdev\u002Fnvidia0`），极易因路径错误导致容器启动失败。\n- **资源争抢风险**：缺乏细粒度的资源计数机制，多个任务可能被调度到同一张显卡上，造成显存溢出或计算冲突，且无法自动隔离。\n- **故障恢复困难**：当某个 GPU 发生硬件故障时，集群无法自动感知健康状态，故障节点上的任务会一直挂起，需人工介入排查并重新调度。\n\n### 使用 k8s-device-plugin 后\n- **自动资源暴露**：k8s-device-plugin 以 DaemonSet 形式自动发现并向 Kubernetes 注册所有节点的 GPU 数量，调度器可直接通过 `nvidia.com\u002Fgpu` 资源请求进行智能匹配。\n- **声明式简化**：开发人员只需在 Pod 规格中简单声明需要的 GPU 数量（如 `resources: limits: nvidia.com\u002Fgpu: 1`），无需关心底层驱动挂载细节，容器即可自动获得 GPU 环境。\n- **精确资源隔离**：工具实现了严格的资源计数与隔离，确保每个申请了 GPU 的 Pod 独占或按策略共享指定的显卡资源，彻底杜绝了隐性争抢。\n- **健康监控闭环**：k8s-device-plugin 持续监控 GPU 健康状态，一旦检测到设备异常，会自动停止向该节点调度新任务，配合 Kubernetes 机制快速迁移受影响的工作负载。\n\nk8s-device-plugin 将原本复杂脆弱的 GPU 集群管理转化为标准化的云原生资源调度，让 AI 业务能够像使用 CPU 一样弹性、稳定地调用算力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA_k8s-device-plugin_7af42788.png","NVIDIA","NVIDIA Corporation","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FNVIDIA_7dcf6000.png","",null,"https:\u002F\u002Fnvidia.com","https:\u002F\u002Fgithub.com\u002FNVIDIA",[82,86,89,93,97],{"name":83,"color":84,"percentage":85},"Go","#00ADD8",92.5,{"name":87,"color":88,"percentage":10},"Shell","#89e051",{"name":90,"color":91,"percentage":92},"Makefile","#427819",2.6,{"name":94,"color":95,"percentage":96},"Mustache","#724b3b",1.5,{"name":98,"color":99,"percentage":100},"Dockerfile","#384d54",0.4,3723,805,"2026-04-17T15:07:52","Apache-2.0",4,"Linux","必需 NVIDIA GPU (驱动版本 >= 384.81)，支持基于 Tegra 的集成 GPU (需 toolkit >= 1.11.0)，支持 MIG 多实例 GPU","未说明",{"notes":110,"python":108,"dependencies":111},"该工具是 Kubernetes 的 DaemonSet，需在所有 GPU 节点上预装 NVIDIA 驱动和 Container Toolkit。若使用 CRI-O，需额外配置将 nvidia-container-runtime 设为默认运行时并添加 crun 到允许列表。若未正确请求 GPU 资源，容器可能会暴露机器上的所有 GPU。生产环境建议使用 Helm 部署而非静态 YAML。",[112,113,114,115,116],"Kubernetes >= 1.10","NVIDIA Drivers ~= 384.81","nvidia-docker >= 2.0 或 nvidia-container-toolkit >= 1.7.0","nvidia-container-runtime (需配置为默认低级运行时)","Docker \u002F containerd \u002F CRI-O",[14,45],[119],"kubernetes","2026-03-27T02:49:30.150509","2026-04-18T09:19:18.086252",[123,128,133,138,143,148],{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},38859,"如何在 Kubernetes 中实现 GPU 共享（即多个容器\u002F进程共用同一块物理 GPU）？","官方插件目前主要通过时间切片（Time Slicing）支持共享。此外，社区提供了一个名为 `nvshare` 的透明 GPU 共享机制，它基于独立研究，允许在不限制内存大小的情况下并发运行多个进程\u002F容器。使用 `nvshare` 时，每个容器都能访问完整的物理显存，且拥有独立的 CUDA 上下文和页表，从而保证内存隔离和安全性。你可以查看该项目：https:\u002F\u002Fgithub.com\u002Fgrgalex\u002Fnvshare","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fissues\u002F169",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},38860,"为什么请求 0 个 GPU (nvidia.com\u002Fgpu: 0) 时，容器仍然获得了所有 GPU 权限？如何禁止这种行为？","默认情况下，如果不请求 GPU 或未正确配置，NVIDIA 镜像可能会暴露所有 GPU。要解决此问题并允许通过环境变量控制可见设备，需要在 GPU Operator 的 Helm values 中配置 toolkit 和 devicePlugin 的环境变量。具体配置如下：\n\ntoolkit:\n  env:\n    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED\n      value: \"false\"\n    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS\n      value: \"true\"\ndevicePlugin:\n  env:\n    - name: DEVICE_LIST_STRATEGY\n      value: volume-mounts\n\n应用后，请检查 `\u002Fusr\u002Flocal\u002Fnvidia\u002Ftoolkit\u002F.config\u002Fnvidia-container-runtime\u002Fconfig.toml` 确认标志值已更新。这样设置后，在 Pod 中设置 `NVIDIA_VISIBLE_DEVICES=none` 即可阻止 GPU 被挂载。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fissues\u002F61",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},38861,"在使用 containerd 作为运行时，nvidia-device-plugin 容器出现 CrashLoopBackOff 错误，日志提示 'Detected non-NVML platform' 或 'Incompatible platform detected'，如何解决？","这通常是因为 containerd 未将 nvidia 设置为默认运行时。对于 k3s 环境，解决方法是修改 containerd 配置文件：\n\n1. 备份配置模板：\nsudo cp \u002Fvar\u002Flib\u002Francher\u002Fk3s\u002Fagent\u002Fetc\u002Fcontainerd\u002Fconfig.toml \u002Fvar\u002Flib\u002Francher\u002Fk3s\u002Fagent\u002Fetc\u002Fcontainerd\u002Fconfig.toml.tmpl\n\n2. 编辑 `\u002Fvar\u002Flib\u002Francher\u002Fk3s\u002Fagent\u002Fetc\u002Fcontainerd\u002Fconfig.toml.tmpl`，在 `[plugins.\"io.containerd.grpc.v1.cri\".containerd]` 部分添加：\ndefault_runtime_name = \"nvidia\"\n\n3. 重启 k3s 服务：\nsudo systemctl restart k3s\n\n4. 验证修复：\nsudo kubectl logs \u003Cgpu-feature-discovery-pod> | grep NVML\n应看到 'Detected NVML platform: found NVML library'。\n同时检查节点标签：\nsudo kubectl describe nodes | grep nvidia.com\u002Fgpu.count","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fissues\u002F406",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},38862,"日志中出现 'error getting GPU device minor number: Not Supported' 导致插件启动失败，这是什么原因？","该错误通常发生在插件无法正确识别 GPU 设备次号码（minor number）时，常见于驱动未正确安装、NVML 库缺失或容器内缺乏访问 `\u002Fdev\u002Fnvidia*` 设备的权限。首先确保宿主机已正确安装 NVIDIA 驱动并能正常运行 `nvidia-smi`。其次，确认已按照前置要求配置了 NVIDIA Container Toolkit。如果是在非标准环境（如某些精简版 OS 或特定云实例），可能需要检查内核模块是否加载以及设备文件是否存在。若问题依旧，建议升级到最新版本的 plugin 并检查是否启用了 MIG 策略配置冲突。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fissues\u002F332",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},38863,"如何在 Kubernetes 集群中正确启用和使用 MPS (Multi-Process Service) 来优化多容器间的 GPU 利用率？","要在 Kubernetes 中使用 MPS，需在宿主机上启动 `nvidia-cuda-mps-control` 守护进程，并在创建 Pod 时启用 `hostIPC` 和 `hostPID` 以允许容器间通信。然而，仅靠这些设置可能不足以让内存限制生效。目前官方插件对 MPS 的原生支持有限，通常需要手动配置容器内的环境变量如 `CUDA_MPS_PINNED_DEVICE_MEM_LIMIT`。需要注意的是，MPS 的内存隔离不如其他方案严格，且配置较为复杂。如果遇到内存限制不生效的问题，建议检查 MPS 控制进程是否在容器内可见，并确认驱动版本（如 495.44+）和 CUDA 版本（如 11.5+）的兼容性。社区也在探讨更灵活的共享机制（如 nvshare）作为替代方案。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fissues\u002F467",{"id":149,"question_zh":150,"answer_zh":151,"source_url":132},38864,"在哪里可以找到关于 NVIDIA_VISIBLE_DEVICES 环境变量及其用法的详细文档？","关于 `NVIDIA_VISIBLE_DEVICES` 等环境变量的详细使用说明，可以参考 Google Docs 上的官方说明文档：https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1zy0key-EL6JH50MZgwg96RPYxxXXnVUdxLZwGiyqLd8\u002Fedit。维护者表示会将这些指令尽快整合到官方文档中。该文档解释了如何通过设置该环境变量为 'none' 来隐藏 GPU，或指定特定 GPU ID 来控制容器可见的设备列表。",[153,158,163,168,173,178,183,188,193,198,203,208,213,218,223,228,233,238,243,248],{"id":154,"version":155,"summary_zh":156,"released_at":157},314793,"v0.19.0","## 更改日志\n\n- 为 GFD 添加 --sleep-interval=infinite 支持，以便作为 Pod 运行 (#1603)\n- 修复静态部署中的镜像标签 (#1604)\n- 为 NodeFeature CR 添加 ownerReference，以支持垃圾回收 (#1597)\n- 更改 gds、gdrcopy 和 mofed 标志的默认值 (#1550)\n- 修复旧设备上的健康检查 (#1562)\n- 在 GFD 中默认启用 NodeFeature API (#1504)\n- 在原生 GitHub Runner 上构建多架构镜像 (#1468)","2026-03-17T18:24:35",{"id":159,"version":160,"summary_zh":161,"released_at":162},314794,"v0.18.2","## 变更内容\n- 确保将 cdi.FeatureFlags 传递给 CDI 库\n- 修复配置管理器中标签未设置时的竞态条件\n- 通过确保 IPC 套接字不会以只读方式挂载，修复嵌套容器的使用场景\n- 将 NVIDIA 容器工具包升级至 v1.18.2\n- 将 distroless 基础镜像升级至 v3.2.2-dev\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.18.1...v0.18.2","2026-01-23T14:24:41",{"id":164,"version":165,"summary_zh":166,"released_at":167},314795,"v0.18.1","## 更改日志\n\n- 允许设置 CDI 功能标志\n- 在设备插件主函数中将驱动程序根目录传递给 nvinfo.New\n- 将 NVIDIA 容器工具包升级至 v1.18.1\n- 将 distroless 基础镜像升级至 v3.2.1-dev\n- 将 github.com\u002Fopencontainers\u002Fselinux 从 1.12.0 升级至 1.13.1 (#1506)\n\n\n**完整更改日志**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.18.0...v0.18.1","2026-01-07T22:10:05",{"id":169,"version":170,"summary_zh":171,"released_at":172},314796,"v0.18.0","## 更改日志\n\n- 重命名 getHealthCheckXids 方法并澄清文档\n- 添加对在健康检查中显式启用 XID 的支持\n- 去除请求的设备 ID 重复项\n- 在读取布尔型配置值之前先检查是否为 nil\n- 将 CDI 中的门控模式（GDS、MOFED、GDRCOPY）设置为可选\n- 添加设置 gdrcopyEnabled 的支持\n- 忽略使用 NVML 获取设备内存时的错误\n- 确保目录卷的类型为 Directory\n- 构建时切换到纯 Go 镜像\n- 移除不必要的中间容器\n- 更新 CI 定义\n- 切换到 distroless Go 镜像\n- 在 README.md 中添加 RuntimeClass 相关内容\n- 在整个 device-plugin 方法调用栈中传递单个上下文 (#1284)\n- 移除内部日志记录器，改用 klog (#1277)\n- 从静态示例中移除 FAIL_ON_INIT_ERROR\n- 检测 Blackwell 架构\n- 更新 .release:staging，将 device-plugin 镜像暂存至 nvstaging\n- 使用 MiB 而不是 MB 表示 GPU 内存\n- 忽略 XID 错误 109\n- 更新 README.md，调整 Docker 运行时默认设置\n- 移除 nvidia.com\u002Fgpu.imex-domain 标签\n- 修复创建 kind 集群时 containerd runc 配置错误的问题\n- 创建 kind 集群时使用稳定的 NVIDIA Container Toolkit 仓库\n- 切换到 Go 标准库中的 context 包\n- 如果 GPU 模式标签器失败，发出警告而非错误\n- 为计算能力 8.9 添加 Ada Lovelace 架构标签\n- 确保 FAIL_ON_INIT_ERROR 布尔环境变量被加引号\n- 当未找到资源时，仍尊重 fail-on-init-error 设置\n- 在 mps-control-daemon Pod 中启用 hostPID (#1045)\n\n**完整更改日志**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.17.1...v0.18.0","2025-10-21T13:53:30",{"id":174,"version":175,"summary_zh":176,"released_at":177},314797,"v0.17.4","## 变更内容\n* 由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1317 中将 slackapi\u002Fslack-github-action 从 2.1.0 升级至 2.1.1\n* 由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1346 中将 github.com\u002FNVIDIA\u002Fgo-nvlib 从 0.7.2 升级至 0.7.4\n* 由 @dependabot[bot] 在 \u002Fdeployments\u002Fdevel 中将 golang 从 1.23.11 升级至 1.23.12，相关 PR 为 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1355\n* 由 @elezar 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1368 中确保目录卷的类型为 Directory\n* 由 @dependabot[bot] 在 \u002Fdeployments\u002Fcontainer 中将 nvidia\u002Fcuda 从 12.9.1-base-ubi9 升级至 13.0.0-base-ubi9，相关 PR 为 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1369\n* 由 @elezar 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1374 中忽略使用 NVML 获取设备内存时的错误\n* 由 @cdesiniotis 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1402 中将项目版本升级至 v0.17.4\n* [无发布说明] 由 @cdesiniotis 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1406 中更新发布流水线的 NGC 发布逻辑\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.17.3...v0.17.4","2025-09-09T18:53:23",{"id":179,"version":180,"summary_zh":181,"released_at":182},314798,"v0.17.3","## 变更内容\n* 由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1275 中将 github.com\u002FNVIDIA\u002Fnvidia-container-toolkit 从 1.17.6 升级至 1.17.8\n* 由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1300 中将 nvidia\u002Fcuda 从 12.9.0-base-ubi9 升级至 12.9.1-base-ubi9，位于 \u002Fdeployments\u002Fcontainer 目录下\n* 由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1287 中将 github.com\u002FNVIDIA\u002Fgo-nvml 从 0.12.4-1 升级至 0.12.9-0\n* 由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1283 中将 golang 从 1.23.9 升级至 1.23.10，位于 \u002Fdeployments\u002Fdevel 目录下\n* 由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1318 中将 golang 从 1.23.10 升级至 1.23.11，位于 \u002Fdeployments\u002Fdevel 目录下\n* 由 @elezar 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1326 中发布 v0.17.3 版本\n* 后向移植：由 @cdesiniotis 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1328 中将 golang.org\u002Fx\u002Foauth2 从 0.23.0 升级至 0.27.0\n* 更新了 .release:staging 配置，以在 nvstaging 中暂存设备插件镜像，由 @elezar 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1329 中完成\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.17.2...v0.17.3","2025-07-24T09:53:38",{"id":184,"version":185,"summary_zh":186,"released_at":187},314799,"v0.17.2","## 变更内容\n- 更新 nvidia.com\u002Fgpu.product 标签，以包含 Blackwell 架构\n- 更新文档，说明 nvidia.com\u002Fgpu.memory 标签的单位为 MiB，而非 MB\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.17.1...v0.17.2","2025-05-13T18:21:16",{"id":189,"version":190,"summary_zh":191,"released_at":192},314800,"v0.17.1","## 变更内容\n* 在 \u002Fdeployments\u002Fdevel 中，由 @dependabot 将 Go 语言版本从 1.23.2 升级至 1.23.3，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1063\n* 在一个目录内，由 @dependabot 对 k8sio 组织下的多个依赖进行升级，共包含 5 次更新，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1066\n* 由 @elezar 确保 FAIL_ON_INIT_ERROR 布尔型环境变量被加引号，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1076\n* 在 \u002Fdeployments\u002Fcontainer 中，由 @dependabot 将 nvidia\u002Fcuda 镜像从 12.6.2-base-ubi9 升级至 12.6.3-base-ubi9，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1084\n* 由 @dependabot 将 github.com\u002FNVIDIA\u002Fnvidia-container-toolkit 从 1.17.0 升级至 1.17.2，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1068\n* 由 @dependabot 将 google.golang.org\u002Fgrpc 从 1.65.0 升级至 1.65.1，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1069\n* 由 @dependabot 将 sigs.k8s.io\u002Fnode-feature-discovery 从 0.15.4 升级至 0.15.7，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1070\n* 由 @dependabot 将 NVIDIA\u002Fholodeck 从 0.2.3 升级至 0.2.4，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1064\n* 由 @elezar 实现在未找到资源时尊重 fail-on-init-error 配置，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1061\n* 由 @dependabot 将 github.com\u002Fopencontainers\u002Fselinux 从 1.11.0 升级至 1.11.1，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1067\n* 由 @elezar 添加针对计算能力为 8.9 的 ada-lovelace 架构标签，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1090\n* 由 @elezar 切换到 Go 标准库中的 context 包，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1114\n* 由 @dependabot 将 github.com\u002FNVIDIA\u002Fnvidia-container-toolkit 从 1.17.2 升级至 1.17.4，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1138\n* 在 \u002Fdeployments\u002Fcontainer 中，由 @dependabot 将 nvidia\u002Fcuda 镜像从 12.6.3-base-ubi9 升级至 12.8.0-base-ubi9，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1142\n* 由 @dependabot 将 NVIDIA\u002Fholodeck 从 0.2.4 升级至 0.2.5，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1131\n* 由 @dependabot 将 slackapi\u002Fslack-github-action 从 1.27.0 升级至 2.0.0，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1065\n* 由 @dependabot 将 github.com\u002FNVIDIA\u002Fgo-nvlib 从 0.7.0 升级至 0.7.1，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1151\n* 由 @elezar 忽略 XID 错误 109，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1171\n* 由 @elezar 移除 nvidia.com\u002Fgpu.imex-domain 标签，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1152\n* 由 @dependabot 将 azure\u002Fsetup-helm 从 4.2.0 升级至 4.3.0，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1176\n* 由 @elezar 将 github.com\u002FNVIDIA\u002Fnvidia-container-toolkit 从 1.17.4 升级至 1.17.5-rc.1，详见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fpull\u002F1192\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.17.0...v0.17.1","2025-03-12T09:59:59",{"id":194,"version":195,"summary_zh":196,"released_at":197},314801,"v0.17.0","## 变更内容\n### v0.17.0\n- 将 v0.17.0-rc.1 正式发布（GA）\n\n### v0.17.0-rc.1\n- 如果包含卷挂载列表策略，则添加 CAP_SYS_ADMIN 权限\n- 移除不必要的 DEVICE_PLUGIN_MODE 环境变量\n- 修复 MPS 的 SELinux 标签应用问题\n- 使用与 ubi-minimal 基础镜像一致的基础镜像\n- 切换到基于 ubi9 的基础镜像\n- 从集群作用域资源中移除 namespace 字段\n- 为 IMEX 集群和域名生成标签\n- 添加可选的默认 IMEX 通道注入功能\n- 允许将 kubelet 套接字作为命令行参数指定\n","2024-10-31T15:36:58",{"id":199,"version":200,"summary_zh":201,"released_at":202},314802,"v0.17.0-rc.1","## 变更内容\n\n- 如果包含卷挂载列表策略，则添加 CAP_SYS_ADMIN 权限\n- 移除不必要的 DEVICE_PLUGIN_MODE 环境变量\n- 修复 MPS 的 SELinux 标签应用问题\n- 使用与 UBI Minimal 基础镜像一致的基础镜像\n- 切换到基于 UBI 9 的基础镜像\n- 从集群作用域的资源中移除命名空间字段\n- 为 IMEX 集群和域生成标签\n- 添加可选的默认 IMEX 通道注入功能\n- 允许将 kubelet 套接字作为命令行参数指定","2024-10-31T15:40:27",{"id":204,"version":205,"summary_zh":206,"released_at":207},314803,"v0.16.2","## What's Changed\r\n* Fix applying SELinux label for MPS\r\n* Remove unneeded DEVICE_PLUGIN_MODE envvar\r\n* Add CAP_SYS_ADMIN if volume-mounts list strategy is included (fixes #856)\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.16.1...v0.16.2","2024-08-08T11:02:06",{"id":209,"version":210,"summary_zh":211,"released_at":212},314804,"v0.16.1","## Changelog\r\n\r\n## What's Changed\r\n* Bump nvidia-container-toolkit to v1.16.1 to fix a bug with CDI spec generation for MIG devices\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.16.0...v0.16.1","2024-07-26T18:37:00",{"id":214,"version":215,"summary_zh":216,"released_at":217},314805,"v0.16.0","## Changelog\r\n\r\n### v0.16.0\r\n- Fixed logic of atomic writing of the feature file\r\n- Replaced `WithDialer` with `WithContextDialer`\r\n- Fixed SELinux context of MPS pipe directory.\r\n- Changed behavior for empty MIG devices to issue a warning instead of an error when the mixed strategy is selected\r\n- Added a a GFD node label for the GPU mode.\r\n- Update CUDA base image version to 12.5.1\r\n\r\n### v0.16.0-rc.1\r\n- Skip container updates if only CDI is selected\r\n- Allow cdi hook path to be set\r\n- Add nvidiaDevRoot config option\r\n- Detect devRoot for driver installation\r\n- Changed the automatically created MPS \u002Fdev\u002Fshm to half of the total memory as obtained from \u002Fproc\u002Fmeminfo\r\n- Remove redundant version log\r\n- Remove provenance information from image manifests\r\n- add ngc image signing job for auto signing\r\n- fix: target should be binaries\r\n- Allow device discovery strategy to be specified\r\n- Refactor cdi handler construction\r\n- Add addMigMonitorDevices field to nvidia-device-plugin.options helper\r\n- Fix allPossibleMigStrategiesAreNone helm chart helper\r\n- use the helm quote function to wrap boolean values in quotes\r\n- Fix usage of hasConfigMap\r\n- Make info, nvml, and device lib construction explicit\r\n- Clean up construction of WSL devices\r\n- Remove unused function\r\n- Don't require node-name to be set if not needed\r\n- Make vgpu failures non-fatal\r\n- Use HasTegraFiles over IsTegraSystem\r\n- Raise error for MPS when using MIG\r\n- Align container driver root envvars\r\n- Update github.com\u002FNVIDIA\u002Fgo-nvml to v0.12.0-6\r\n- Add unit tests cases for sanitise func\r\n- Improving logic to sanitize GFD generated node labels\r\n- Add newline to pod logs\r\n- Adding vfio manager\r\n- Add prepare-release.sh script\r\n- Don't require node-name to be set if not needed\r\n- Remove GitLab pipeline .gitlab.yml\r\n- E2E test: fix object names\r\n- strip parentheses from the gpu product name\r\n- E2E test: instanciate a logger for helm outputs\r\n- E2E test: enhance logging via ginkgo\u002Fgomega\r\n- E2E test: remove e2elogs helper pkg\r\n- E2E test: Create HelmClient during Framework init\r\n- E2E test: Add -ginkgo.v flag to increase verbosity\r\n- E2E test: Create DiagnosticsCollector\r\n- Update vendoring\r\n- Replace go-nvlib\u002Fpkg\u002Fnvml with go-nvml\u002Fpkg\u002Fnvml\r\n- Add dependabot updates for release-0.15\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.15.0...v0.16.0","2024-07-16T13:29:25",{"id":219,"version":220,"summary_zh":221,"released_at":222},314806,"v0.15.1","## Changelog\r\n\r\n- Fix inconsistent usage of `hasConfigMap` helm template. This addresses cases where certain resources (roles and service accounts) would be created even if they were not required.\r\n- Raise an error in GFD when MPS is used with MIG. This ensures that the behavior across GFD and the Device Plugin is consistent.\r\n- Remove provenance information from published images.\r\n- Use half of total memory for size of MPS tmpfs by default.\r\n","2024-06-25T12:07:00",{"id":224,"version":225,"summary_zh":226,"released_at":227},314807,"v0.16.0-rc.1","## Changelog\r\n\r\n- Add script to create release\r\n- Fix handling of device-discovery-strategy for GFD\r\n- Skip README updates for rc releases\r\n- Fix generate-changelog.sh script\r\n- Skip container updates if only CDI is selected\r\n- Allow cdi hook path to be set\r\n- Add nvidiaDevRoot config option\r\n- Detect devRoot for driver installation\r\n- Set \u002Fdev\u002Fshm size from \u002Fproc\u002Fmeminfo\r\n- Remove redundant version log\r\n- Remove provenance information from image manifests\r\n- add ngc image signing job for auto signing\r\n- fix: target should be binaries\r\n- Allow device discovery strategy to be specified\r\n- Refactor cdi handler construction\r\n- Add addMigMonitorDevices field to nvidia-device-plugin.options helper\r\n- Fix allPossibleMigStrategiesAreNone helm chart helper\r\n- use the helm quote function to wrap boolean values in quotes\r\n- Fix usage of hasConfigMap\r\n- Make info, nvml, and device lib construction explicit\r\n- Clean up construction of WSL devices\r\n- Remove unused function\r\n- Don't require node-name to be set if not needed\r\n- Make vgpu failures non-fatal\r\n- Use HasTegraFiles over IsTegraSystem\r\n- Raise error for MPS when using MIG\r\n- Align container driver root envvars\r\n- Update github.com\u002FNVIDIA\u002Fgo-nvml to v0.12.0-6\r\n- Add unit tests cases for sanitise func\r\n- Improving logic to sanitize GFD generated node labels\r\n- Add newline to pod logs\r\n- Adding vfio manager\r\n- Add prepare-release.sh script\r\n- Don't require node-name to be set if not needed\r\n- Remove GitLab pipeline .gitlab.yml\r\n- E2E test: fix object names\r\n- strip parentheses from the gpu product name\r\n- E2E test: instanciate a logger for helm outputs\r\n- E2E test: enhance logging via ginkgo\u002Fgomega\r\n- E2E test: remove e2elogs helper pkg\r\n- E2E test: Create HelmClient during Framework init\r\n- E2E test: Add -ginkgo.v flag to increase verbosity\r\n- E2E test: Create DiagnosticsCollector\r\n- Update vendoring\r\n- Replace go-nvlib\u002Fpkg\u002Fnvml with go-nvml\u002Fpkg\u002Fnvml\r\n- Add dependabot updates for release-0.15\r\n","2024-06-18T15:02:15",{"id":229,"version":230,"summary_zh":231,"released_at":232},314808,"v0.15.0","The NVIDIA GPU Device Plugin v0.15.0 release includes the following major changes:\r\n\r\n### Consolidated the NVIDIA GPU Device Plugin and NVIDIA GPU Feature Discovery repositories\r\nSince the NVIDIA GPU Device Plugin and GPU Feature Discovery (GFD) components are often used together, we have consolidated the repositories. The primary goal was to streamline the development and release process and functionality remains unchanged. The user facing changes are as follows:\r\n* The two components will use the same version, meaning that the GFD version jumps from `v0.8.2` to `v0.15.0`.\r\n* The two components use the same container image, meaning that instead of `nvcr.io\u002Fnvidia\u002Fgpu-feature-discovery` is to be used `nvcr.io\u002Fnvidia\u002Fk8s-device-plugin`. Note that this may mean that the `gpu-feature-discovery` command needs to be explicitly specified.\r\n\r\nIn order to facilitate the transition for users that rely on a standalone GFD deployment, this release includes a `gpu-feature-discovery` helm chart in the device plugin helm repository.\r\n\r\n### Added **experimental** support for GPU partitioning using MPS.\r\n\r\nThis release of the NVIDIA GPU Device Plugin includes experiemental support for GPU sharing using [CUDA MPS](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Ftree\u002Fv0.15.0?tab=readme-ov-file#with-cuda-mps). Feedback on this feature is appreciated.\r\n\r\n**This functionality is not production ready and includes a number of known issues including:**\r\n* The device plugin may show as started before it is ready to allocate shared GPUs while waiting for the CUDA MPS control daemon to come online.\r\n* There is no synchronization between the CUDA MPS control daemon and the GPU Device Plugin under restarts or configuration changes. This means that workloads may crash if they lose access to shared resources controlled by the CUDA MPS control daemon.\r\n* MPS is only supported for full GPUs.\r\n* It is not possible to \"combine\" MPS GPU requests to allow for access to more memory by a single container.\r\n\r\n## Deprecation Notice\r\n\r\nThe following table shows a set of new CUDA driver and runtime version labels and their existing equivalents. The existing labels should be considered deprecated and will be removed in a future release.\r\n\r\n| New Label | Deprecated Label |\r\n|---|---|\r\n| `nvidia.com\u002Fcuda.driver-version.major` | `nvidia.com\u002Fcuda.driver.major` |\r\n| `nvidia.com\u002Fcuda.driver-version.minor` | `nvidia.com\u002Fcuda.driver.minor` |\r\n| `nvidia.com\u002Fcuda.driver-version.revision` | `nvidia.com\u002Fcuda.driver.rev` |\r\n| `nvidia.com\u002Fcuda.driver-version.full` |    |\r\n| `nvidia.com\u002Fcuda.runtime-version.major` | `nvidia.com\u002Fcuda.runtime.major` |\r\n| `nvidia.com\u002Fcuda.runtime-version.minor` | `nvidia.com\u002Fcuda.runtime.minor` |\r\n| `nvidia.com\u002Fcuda.runtime-version.full` |    |\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.14.0...v0.15.0\r\n\r\n## Changes since v0.15.0-rc.2\r\n- Moved `nvidia-device-plugin.yml` static deployment at the root of the repository to `deployments\u002Fstatic\u002Fnvidia-device-plugin.yml`.\r\n- Simplify PCI device clases in NFD worker configuration.\r\n- Update CUDA base image version to 12.4.1.\r\n- Switch to Ubuntu22.04-based CUDA image for default image.\r\n- Add new CUDA driver and runtime version labels to align with other NFD version labels.\r\n- Update NFD dependency to v0.15.3.\r\n\r\n## v0.15.0-rc.2\r\n- Bump CUDA base image version to 12.3.2\r\n- Add `cdi-cri` device list strategy. This uses the CDIDevices CRI field to request CDI devices instead of annotations.\r\n- Set MPS memory limit by device index and not device UUID. This is a workaround for an issue where\r\n  these limits are not applied for devices if set by UUID.\r\n- Update MPS sharing to disallow requests for multiple devices if MPS sharing is configured.\r\n- Set mps device memory limit by index.\r\n- Explicitly set sharing.mps.failRequestsGreaterThanOne = true.\r\n- Run tail -f for each MPS daemon to output logs.\r\n- Enforce replica limits for MPS sharing.\r\n\r\n## v0.15.0-rc.1\r\n- Import GPU Feature Discovery into the GPU Device Plugin repo. This means that the same version and container image is used for both components.\r\n- Add tooling to create a kind cluster for local development and testing.\r\n- Update `go-gpuallocator` dependency to migrate away from the deprecated `gpu-monitoring-tools` NVML bindings.\r\n- Remove `legacyDaemonsetAPI` config option. This was only required for k8s versions \u003C 1.16.\r\n- Add support for MPS sharing.\r\n- Bump CUDA base image version to 12.3.1\r\n","2024-04-17T12:22:16",{"id":234,"version":235,"summary_zh":236,"released_at":237},314809,"v0.15.0-rc.2","## What's changed\r\n- Bump CUDA base image version to 12.3.2\r\n- Add `cdi-cri` device list strategy. This uses the CDIDevices CRI field to request CDI devices instead of annotations.\r\n- Set MPS memory limit by device index and not device UUID. This is a workaround for an issue where\r\n  these limits are not applied for devices if set by UUID.\r\n- Update MPS sharing to disallow requests for multiple devices if MPS sharing is configured.\r\n- Set mps device memory limit by index.\r\n- Explicitly set sharing.mps.failRequestsGreaterThanOne = true.\r\n- Run tail -f for each MPS daemon to output logs.\r\n- Enforce replica limits for MPS sharing.\r\n\r\n","2024-03-18T11:48:36",{"id":239,"version":240,"summary_zh":241,"released_at":242},314810,"v0.14.5","## What's Changed\r\n* Update the nvidia-container-toolkit go dependency. This fixes a bug in CDI spec generation on systems were `lib -> usr\u002Flib` symlinks exist.\r\n* Update the CUDA base images to 12.3.2\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.14.4...v0.14.5","2024-02-29T10:23:59",{"id":244,"version":245,"summary_zh":246,"released_at":247},314811,"v0.15.0-rc.1","## What's Changed\r\n- Import GPU Feature Discovery into the GPU Device Plugin repo. This means that the same version and container image is used for both components.\r\n- Add tooling to create a kind cluster for local development and testing.\r\n- Update `go-gpuallocator` dependency to migrate away from the deprecated `gpu-monitoring-tools` NVML bindings.\r\n- Remove `legacyDaemonsetAPI` config option. This was only required for k8s versions \u003C 1.16.\r\n- Add support for MPS sharing.\r\n- Bump CUDA base image version to 12.3.1\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.14.0...v0.15.0-rc.1","2024-02-26T13:59:09",{"id":249,"version":250,"summary_zh":251,"released_at":252},314812,"v0.14.4","## What's Changed\r\n\r\n- Update to refactored go-gpuallocator code. This permanently fixes the NVML_NVLINK_MAX_LINKS value addressed in a\r\n  hotfix in v0.14.3. This also addresses a bug due to uninitialized NVML when calling go-gpuallocator.\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fk8s-device-plugin\u002Fcompare\u002Fv0.14.3...v0.14.4","2024-01-29T14:42:56"]