[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Langhalsdino--Kubernetes-GPU-Guide":3,"tool-Langhalsdino--Kubernetes-GPU-Guide":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160015,2,"2026-04-18T11:30:52",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":91,"forks":92,"last_commit_at":93,"license":94,"difficulty_score":95,"env_os":96,"env_gpu":97,"env_ram":98,"env_deps":99,"category_tags":107,"github_topics":108,"view_count":32,"oss_zip_url":80,"oss_zip_packed_at":80,"status":17,"created_at":120,"updated_at":121,"faqs":122,"releases":150},9161,"Langhalsdino\u002FKubernetes-GPU-Guide","Kubernetes-GPU-Guide","This guide should help fellow researchers and hobbyists to easily automate and accelerate there deep leaning training with their own Kubernetes GPU cluster.","Kubernetes-GPU-Guide 是一份专为研究人员和深度学习爱好者打造的开源实战指南，旨在帮助用户轻松搭建并自动化管理基于 Kubernetes 的 GPU 集群。它主要解决了深度学习工作流中常见的痛点：在本地设计算法后，将其迁移至云端进行大规模训练时往往耗时漫长、配置繁琐且容易出错。通过这份指南，用户可以将复杂的集群部署过程简化，显著加速模型训练迭代。\n\n该指南特别适合拥有多台 Ubuntu 服务器的开发者、AI 研究员以及希望自建算力基础设施的技术爱好者。其核心亮点在于提供了一套完整的自动化方案，包括实用的 Shell 脚本和预配置的 YAML 文件，能够一键完成从节点初始化到 GPU 容器构建的全流程设置。架构上，它倡导采用\"CPU 主控节点 + 多 GPU 工作节点”的高效模式，既降低了主节点成本，又最大化了并行计算能力。尽管技术环境更新迅速，但该项目提供了清晰的结构概述和详细的手动\u002F自动双模式安装指令，是构建私有高性能深度学习平台的宝贵参考资源。","# How to automate deep learning training with Kubernetes GPU-cluster\n\nThis guide should help fellow researchers and hobbyists to easily automate and accelerate there deep leaning training with their own Kubernetes GPU cluster.\u003C\u002Fbr>\nTherefore I will explain how to easily setup a GPU cluster on multiple Ubuntu 16.04 bare metal servers and provide some useful scripts and .yaml files that do the entire setup for you.\n\nBy the way: If you need a Kubernetes GPU-cluster for other reasons, this guide might be helpful to you as well.\n\n**Why did i write this guide?**\u003C\u002Fbr>\nI have worked as in intern for the Startup [understand.ai](https:\u002F\u002Funderstand.ai) and noticed the hassle of firstly designing a machine learning algorithm locally and than bringing it to the cloud for training with different parameters and datasets.\u003C\u002Fbr>\nThe second part, bringing it to the cloud for extensive training, takes always longer than thought, is frustrating and involves usually a lot of pitfalls.\n\nFor this reason i decided to work on this problem and make the second part effortless, easy and quick.\u003C\u002Fbr>\nThe result of this work is this handy guide, that describes how everyone can setup their own Kubernetes GPU cluster to accelerate their work.\n\n**The new process for the deep learning researchers:**\u003C\u002Fbr>\nThe automated deep learning training with a Kubernetes GPU-cluster improves the process of brining your algorithm for training in the cloud significantly.\n\nThis illustration visualizes the new workflow, that involves only two simple steps:\u003C\u002Fbr>\n![My inspiration for the project, designed by Langhalsdino.](resources\u002Fdescription.jpg?raw=true \"My inspiration for the project\")\n\n**Disclaimer**\u003C\u002Fbr>\nBe aware, that the following sections might be opinionated. Kubernetes is an evolving, fast paced environment, which means this guide will probably be outdated at times, depending on the authors spare time and individual contributions. Due to this fact contributions are highly appreciated.\n\n## Table of Contents\n\n  * [Quick Kubernetes revive](#quick-kubernetes-revive)\n  * [Rough overview on the structure of the cluster](#rough-overview-on-the-structure-of-the-cluster)\n  * [Initiate nodes](#initiate-nodes)\n    - [Constraints of my setup](#constraints-of-my-setup)\n    - [Setup instructions](#setup-instructions)\n        - [Use fast setup script](#fast-track---setup-script)\n        - [Manually step by step instructions](#detailed-step-by-step-instructions)\n  * [How to build your GPU container](#how-to-build-your-gpu-container)\n  * [Some helpful commands](#some-helpful-commands)\n  * [Acknowledgements](#acknowledgements)\n  * [License](#license)\n\n## Quick Kubernetes revive\n\n**These articles might be helpful, if you need to refresh your Kubernetes knowledge:**\n\n  * [Introduction to Kubernetes by DigitalOcean](https:\u002F\u002Fwww.digitalocean.com\u002Fcommunity\u002Ftutorials\u002Fan-introduction-to-kubernetes)\n  * [Kubernetes concepts](https:\u002F\u002Fkubernetes.io\u002Fdocs\u002Fconcepts\u002F)\n  * [Kubernetes by example](http:\u002F\u002Fkubernetesbyexample.com\u002F)\n  * [Kubernetes basics - interactive tutorial](https:\u002F\u002Fkubernetes.io\u002Fdocs\u002Ftutorials\u002Fkubernetes-basics\u002F)\n\n## Rough overview on the structure of the cluster\nThe main idea is, to have a small CPU only master node, that controls a cluster of GPU-worker nodes.\n![Rough overview on the structure of the cluster, designed by Langhalsdino](resources\u002FSystem-overview.jpg?raw=true \"Rough overview\")\n\n## Initiate nodes\nBefore we can use the cluster, it is important to firstly initiate the cluster. \u003C\u002Fbr>\nTherefore each node has to be manually initiated and joined to the cluster.\n\n### Constraints of my setup\nThis are the constraints for my setup, I have been in some places tighter than necessary, but this is my setup and it worked for me 😒  \n\n**Master**\n\n+ Ubuntu 16.04\n+ SSH access with sudo user\n+ Internet access\n+ ufw deactivated (not recommended, but for ease of use)\n+ Enabled Ports (udp and tcp)\n    - 6443, 443, 8080\n    - 30000-32767 (only if your apps need them)\n    - These will be used to access services from outside of the cluster\n\n**Worker**\n\n+ Ubuntu 16.04\n+ SSH access with sudo user\n+ Internet access\n+ ufw deactivated (not recommended, but for ease of use)\n+ Enabled Ports (udp and tcp)\n    - 6443, 443\n\n### Setup instructions\nThese instruction cover my experience on Ubuntu 16.04 and may or may not be suited to transfer to other OS’s.\n\nI have created two scripts that fully initiate the master and worker node as described bellow. If you want to take the fast track, just use them. Otherwise, i recommended to read the step by step instructions.\n\n\u003Ch4>Fast Track - Setup script\u003C\u002Fh4>\nOk, lets take the fast track. Copy the corresponding scripts on your master and workers.\u003C\u002Fbr>\nFurthermore make sure that your setup fits into my constraints.\n\n**MASTER NODE**\n\nExecute the initialization script and remember the token 😉 \u003Cbr\u002F>\nThe token will look like this: ```—token f38242.e7f3XXXXXXXXe231e```.\n\n```\nchmod +x init-master.sh\nsudo .\u002Finit-master.sh \u003CIP-of-master>\n```\n\n**WORKER NODE**\n\nExecute the initialization script with the correct token and IP of your master.\u003Cbr\u002F>\nThe port is usually ```6443```.\n\n```\nchmod +x init-worker.sh\nsudo .\u002Finit-worker.sh \u003CToken-of-Master> \u003CIP-of-master>:\u003CPort>\n```\n\n\u003Ch4>Detailed step by step instructions\u003C\u002Fh4>\n\n**MASTER NODE**\n\n**1.** Add Kubernetes Repository to the packagemanager\n```\nsudo su -\napt-get update && apt-get install -y apt-transport-https\ncurl -s https:\u002F\u002Fpackages.cloud.google.com\u002Fapt\u002Fdoc\u002Fapt-key.gpg | apt-key add -\ncat \u003C\u003CEOF >\u002Fetc\u002Fapt\u002Fsources.list.d\u002Fkubernetes.list\ndeb http:\u002F\u002Fapt.kubernetes.io\u002F kubernetes-xenial main\nEOF\napt-get update\nexit\n```\n\n**2.** Install docker-engine, kubeadm, kubectl, kubernetes-cni\n\n```\nsudo apt-get install -y docker-engine\nsudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni\nsudo groupadd docker\nsudo usermod -aG docker $USER\necho 'You might need to reboot \u002F relogin to make docker work correctly'\n```\n\n**3.** Since we want to build a cluster that uses GPUs we need to enable GPU acceleration in the master node.\nKeep in mind, that this instruction may become obsolete or change completely in a later version of Kubernetes!\n\n**3.I**\nAdd GPU support to the Kubeadm configuration, while cluster is not initialized.\n```\nsudo vim \u002Fetc\u002Fsystemd\u002Fsystem\u002Fkubelet.service.d\u002F\u003C\u003CNumber>>-kubeadm.conf\n```\nappend ExecStart with the flag ```—feature-gates=\"Accelerators=true\"```, so it will look like this:\n```\nExecStart=\u002Fusr\u002Fbin\u002Fkubelet $KUBELET_KUBECONFIG_ARGS [...] --feature-gates=\"Accelerators=true\"\n```\n\n**3.II** Restart kubelet\n```\nsudo systemctl daemon-reload\nsudo systemctl restart kubelet\n```\n\n**4.** Now we will initialize the master node.\u003Cbr\u002F>\nTherefore you will need the IP of your master node.\nFurthermore this step will provide you with the credentials to add further worker nodes, so remember your token 😉 \u003C\u002Fbr>\nThe token will look like this: ``` —token f38242.e7f3XXXXXXXXe231e 130.211.XXX.XXX:6443```\n```\nsudo kubeadm init --apiserver-advertise-address=\u003Cip-address>\n```\n**5.** Since Kubernetes 1.6 changed from ABAC roll-management to RBAC we need to advertise the credentials of the user.\nYou will need to perform this step for each time you will log into the machine!!\n```\nsudo cp \u002Fetc\u002Fkubernetes\u002Fadmin.conf $HOME\u002F\nsudo chown $(id -u):$(id -g) $HOME\u002Fadmin.conf\nexport KUBECONFIG=$HOME\u002Fadmin.conf\n```\n\n**6.** Install network add-on that your pods can communicate with each other. Kubernetes 1.6 has some requirements for the network add-on, some of them are:\n\n + CNI-based networks\n + RBAC support\n\nThis GoogleSheet contains a selection of suitable network add- on GoogleSheet-Network-Add-on-vergleich .\nI will use wave-works, just because of my personal preference ;)\n```\nkubectl apply -f https:\u002F\u002Fgit.io\u002Fweave-kube-1.6\n```\n**5.II** You are ready to go, maybe check your pods to confirm that everything is working ;)\n```\nkubectl get pods —all-namespaces\n```\n**N.** If you want to tear down your master, you will need to reset the master node\n```\nsudo kubeadm reset\n```\n\n**WORKER NODE**\n\nThe beginning should be familiar to you and make this process a lot faster ;)\n\n**1.** Add Kubernetes Repository to the packagemanager\n```\nsudo su -\napt-get update && apt-get install -y apt-transport-https\ncurl -s https:\u002F\u002Fpackages.cloud.google.com\u002Fapt\u002Fdoc\u002Fapt-key.gpg | apt-key add -\ncat \u003C\u003CEOF >\u002Fetc\u002Fapt\u002Fsources.list.d\u002Fkubernetes.list\ndeb http:\u002F\u002Fapt.kubernetes.io\u002F kubernetes-xenial main\nEOF\napt-get update\nexit\n```\n\n**2.** Install docker-engine, kubeadm, kubectl, kubernetes-cni\n\n```\nsudo apt-get install -y docker-engine\nsudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni\nsudo groupadd docker\nsudo usermod -aG docker $USER\necho 'You might need to reboot \u002F relogin to make docker work correctly'\n```\n\n**3.** Since we want to build a cluster that uses GPUs we need to enable GPU acceleration in the worker nodes that have a GPU installed.\nKeep in mind, that this instruction may become obsolete or change completely in a later version of Kubernetes!\n\n**3.I**\nAdd GPU support to the Kubeadm configuration, while cluster is not initialized.\n```\nsudo vim \u002Fetc\u002Fsystemd\u002Fsystem\u002Fkubelet.service.d\u002F\u003C\u003CNumber>>-kubeadm.conf\n```\nappend ExecStart with the flag ```—feature-gates=\"Accelerators=true\"```, so it will look like this:\n```\nExecStart=\u002Fusr\u002Fbin\u002Fkubelet $KUBELET_KUBECONFIG_ARGS [...] --feature-gates=\"Accelerators=true\"\n```\n\n**3.II** Restart kubelet\n```\nsudo systemctl daemon-reload\nsudo systemctl restart kubelet\n```\n\n**4.** Now we will add the worker to the cluster.\u003Cbr\u002F>\nTherefore you will need to remember the token from your master node, so take a deep dive into your notes xD\n```\nsudo kubeadm join --token f38242.e7f3XXXXXXe231e 130.211.XXX.XXX:6443\n```\n**5.** Finished, check your nodes on your master and see if everything worked.\n```\nkubectl get nodes\n```\n**N.** If you want to tear down your worker node, you will need to remove the node from the cluster and reset the worker node.\nFurthermore it will be beneficial to remove the worker node from the cluster  \n***On master:***   \n```\nkubectl delete node \u003Cworker node name>\n```\n***On worker node***\n```     \nsudo kubeadm reset\n```\n\n**Client**\n\nIn order to control your cluster e.g. your master from your client, you will need to authenticate your client with the right user.\nThis guid won’t cover creating a separate user for client, we will just copy the user from the master node.\u003Cbr\u002F>\nThis will be easier, trust me 🤓 \u003C\u002Fbr>\n[Instruction to add custom user, will be added in the future]\n\n**1.** Install kubectl on your client. I have only tested it on may mac, but linux should work as well.\nI don’t know about windows, but who cares about windows anyway :D\u003C\u002Fbr>\n**On Mac**\n```\nbrew install kubectl\n```\n**2.** Copy the admin authentication from the master to your client\n```\nscp uai@130.211.XXX.64:~\u002Fadmin.conf ~\u002F.kube\u002F\n```\n**3.** Add the admin.conf configuration and credentials to Kubernetes configuration. You will need to do this for every agent\n```\nexport KUBECONFIG=~\u002F.kube\u002Fadmin.conf\n```\nYou are ready to use kubectl on you local client.\n\n**3.II** You can test by listing all your pods\n```\nkubectl get pods —all-namespaces\n```\n\n\n**Install Kubernetes dashboard**\n\nThe kubernetes dashboard is pretty beautiful and gives script kiddies like me access to a lot of functionality.\nIn order to use the dashboard you will need to get your client running, RBAC will ensure it 👮\n\n**You can perform this steps directly on the master or from your client**\n\n**1.** Check if the dashboard is already installed\nkubectl get pods --all-namespaces | grep dashboard\n\n**2.** If the dashboard isn’t installed, install it ;)\n```\nkubectl create -f https:\u002F\u002Fgit.io\u002Fkube-dashboard\n```\nIf this did not work check if the container defined in the .yaml [git.io\u002Fkube-dashboard](https\u002Fgit.io\u002Fkube-dashboard) exist. (This bug cost me a lot of time)\n\nIn order to have access to your dashboard you will need to be authenticated with you client.\n\n**3.** Proxy the dashboard to your client\n```\nkubectl proxy\n```\n\n**4.** Access the dashboard within your browser by visiting\n[127.0.0.1:8001\u002Fui](127.0.0.1:8001\u002Fui)\n\n## How to build your GPU container\nThis guide should help you to get a Docker container running, that needs GPU access.\n\nFor this guide i have chosen to build an example Docker container, that uses TensorFlow GPU binaries and can run TensorFlow programs in a Jupyter notebook.\n\nKeep in mind, that this guide has been written for Kubernetes 1.6, therefore further changes can compromise this guide.\n\n### Essential parts of .yml\nIn order to get your Nvidia GPU with CUDA running you have to pass the Nvidia driver and CUDA libraries to your container.\nSo we will use hostPath to make them available to the Kubernetes pod.\nThe actual path differ from machine to machine, since they are set by your Nvidia driver and CUDA installation.\n```\nvolumes:\n    - hostPath:\n        path: \u002Fusr\u002Flib\u002Fnvidia-375\u002Fbin \n        name: bin\n    - hostPath:\n        path: \u002Fusr\u002Flib\u002Fnvidia-375\n        name: lib\n```\nMount the volumes with the driver and CUDA in the right directory for your container. These might differ, due to specific requirements of your container.\n```\nvolumeMounts:\n    - mountPath: \u002Fusr\u002Flocal\u002Fnvidia\u002Fbin\n        name: bin\n    - mountPath: \u002Fusr\u002Flocal\u002Fnvidia\u002Flib\n        name: lib\n```\nSince you want to tell Kubernetes that you need n GPUs , you can define your requirements here.\n```\nresources:\n    limits:\n        alpha.kubernetes.io\u002Fnvidia-gpu: 1\n```\nThats it, it is everything you need to build your Kuberntes 1.6 container 😏\n\nSome note at the end, that describes my overall experience:\u003Cbr\u002F>\n**Kubernetes + Docker + Machine Learning + GPUs = Pure awesomeness**\n\n### Example GPU deployment\nMy example-gpu-deployment.yaml file describes two parts, a deployment and a service, since i want to make jupyter notebook available form the outside.\n\nRun kubectl apply to make it available to the outside\n```\nkubectl create -f deployment.yaml\n```\n\nThe deployment.yaml file looks like this:\n```\n---\napiVersion: extensions\u002Fv1beta1\nkind: Deployment\nmetadata:\n  name: tf-jupyter\nspec:\n  replicas: 1\n  template:\n    metadata:\n      labels:\n        app: tf-jupyter\n    spec:\n      volumes:\n      - hostPath:\n          path: \u002Fusr\u002Flib\u002Fnvidia-375\u002Fbin\n        name: bin\n      - hostPath:\n          path: \u002Fusr\u002Flib\u002Fnvidia-375\n        name: lib\n      containers:\n      - name: tensorflow\n        image: tensorflow\u002Ftensorflow:0.11.0rc0-gpu\n        ports:\n        - containerPort: 8888\n        resources:\n          limits:\n            alpha.kubernetes.io\u002Fnvidia-gpu: 1\n        volumeMounts:\n        - mountPath: \u002Fusr\u002Flocal\u002Fnvidia\u002Fbin\n          name: bin\n        - mountPath: \u002Fusr\u002Flocal\u002Fnvidia\u002Flib\n          name: lib\n---\napiVersion: v1\nkind: Service\nmetadata:\n  name: tf-jupyter-service\n  labels:\n    app: tf-jupyter\nspec:\n  selector:\n    app: tf-jupyter\n  ports:\n  - port: 8888\n    protocol: TCP\n    nodePort: 30061\n  type: LoadBalancer\n---\n```\n\n## Some helpful commands\n\n**Get commands** with basic output\n```\nkubectl get services                 # List all services in the namespace\nkubectl get pods --all-namespaces    # List all pods in all namespaces\nkubectl get pods -o wide             # List all pods in the namespace, with more details\nkubectl get deployment my-dep        # List a particular deployment\n```\n\n**Describe commands** with verbose output\n```\nkubectl describe nodes \u003Cnode-name>\nkubectl describe pods \u003Cpod-name>\n```\n\n**Deleting Resources**\n```\nkubectl delete -f .\u002Fpod.yaml                   # Delete a pod using the type and name specified in pod.yaml\nkubectl delete pod,service baz foo             # Delete pods and services with same names \"baz\" and \"foo\"\nkubectl delete pods,services -l name=\u003CLabel>   # Delete pods and services with label name=myLabel\nkubectl -n \u003Cnamespace> delete po,svc --all     # Delete all pods and services in namespace my-ns\n```\n\n**Get into the bash console** of one of your pods:\n\n```\nkubectl exec -it \u003Cpod-name> — \u002Fbin\u002Fbash\n```\n\n## Common issues\n\nSome people contacted me with some issues on their CUDA deployment related to the forwarding of drivers.\u003Cbr>\nIf the example-gpu-deployment.yaml is not working for you, i would recommended you to try to install CUDA as described by this guide [Installing Tenserflow on Ubuntu](http:\u002F\u002Fsimonboehm.com\u002Ftech\u002F2017\u002F06\u002F23\u002FinstallingTensorFlow.html) in more detail and try the example-gpu-deployment-nvidia-375-82.yaml.\nIt might be necessary to adjust the version number in the yaml file.\n\nIf you encountered another issue, feel free to open an issue on github.\n\n## Acknowledgements\nThere are a lot of guides, github repositories, issues and people out there who helped me a lot. \u003C\u002Fbr>\nSo I want to thank everybody for their help.\u003C\u002Fbr>\nSpecially the Startup [understand.ai](http:\u002F\u002Funderstand.ai) for their support.\n\n### Authors\n\n* **Frederic Tausch** - *Initial work* - [Langhalsdino](https:\u002F\u002Fgithub.com\u002FLanghalsdino)\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details\n","# 如何使用 Kubernetes GPU 集群自动化深度学习训练\n\n本指南旨在帮助研究人员和爱好者轻松地利用自己的 Kubernetes GPU 集群来自动化并加速深度学习训练。\u003C\u002Fbr>\n因此，我将介绍如何在多台运行 Ubuntu 16.04 的物理服务器上轻松搭建一个 GPU 集群，并提供一些实用的脚本和 .yaml 文件，它们可以为你完成整个设置过程。\n\n顺便说一句：如果你因其他原因需要一个 Kubernetes GPU 集群，本指南也可能对你有所帮助。\n\n**我为什么编写这份指南？**\u003C\u002Fbr>\n我曾在初创公司 [understand.ai](https:\u002F\u002Funderstand.ai) 实习，期间发现了一个问题：先在本地设计好机器学习算法，然后再将其部署到云端进行不同参数和数据集的训练，这个过程非常繁琐。\u003C\u002Fbr>\n而将算法部署到云端进行大规模训练的部分，总是比预期花费更多时间，令人沮丧，而且常常会遇到许多陷阱。\n\n基于此，我决定解决这个问题，让第二步变得毫不费力、简单快捷。\u003C\u002Fbr>\n最终成果就是这份便捷的指南，它详细介绍了如何搭建属于你自己的 Kubernetes GPU 集群，以加速你的工作。\n\n**面向深度学习研究人员的新流程：**\u003C\u002Fbr>\n通过 Kubernetes GPU 集群实现深度学习训练的自动化，可以显著改善将算法部署到云端进行训练的流程。\n\n这张图展示了新的工作流程，只需两个简单的步骤：\u003C\u002Fbr>\n![该项目的灵感来源，由 Langhalsdino 设计。](resources\u002Fdescription.jpg?raw=true \"该项目的灵感来源\")\n\n**免责声明**\u003C\u002Fbr>\n请注意，以下内容可能带有主观性。Kubernetes 是一个快速发展的环境，这意味着本指南可能会随着时间推移而过时，具体取决于作者的空闲时间和社区贡献者的参与情况。因此，我们非常欢迎各位的贡献。\n\n## 目录\n\n  * [Kubernetes 快速回顾](#quick-kubernetes-revive)\n  * [集群结构概览](#rough-overview-on-the-structure-of-the-cluster)\n  * [初始化节点](#initiate-nodes)\n    - [我的设置限制](#constraints-of-my-setup)\n    - [设置步骤](#setup-instructions)\n        - [使用快速设置脚本](#fast-track---setup-script)\n        - [手动详细步骤](#detailed-step-by-step-instructions)\n  * [如何构建 GPU 容器](#how-to-build-your-gpu-container)\n  * [一些实用命令](#some-helpful-commands)\n  * [致谢](#acknowledgements)\n  * [许可证](#license)\n\n## Kubernetes 快速回顾\n\n**如果你需要复习 Kubernetes 知识，以下文章可能会有所帮助：**\n\n  * [DigitalOcean 的 Kubernetes 入门教程](https:\u002F\u002Fwww.digitalocean.com\u002Fcommunity\u002Ftutorials\u002Fan-introduction-to-kubernetes)\n  * [Kubernetes 核心概念](https:\u002F\u002Fkubernetes.io\u002Fdocs\u002Fconcepts\u002F)\n  * [Kubernetes 示例教程](http:\u002F\u002Fkubernetesbyexample.com\u002F)\n  * [Kubernetes 基础交互式教程](https:\u002F\u002Fkubernetes.io\u002Fdocs\u002Ftutorials\u002Fkubernetes-basics\u002F)\n\n## 集群结构概览\n核心思想是拥有一台仅包含 CPU 的主节点，用于控制一组 GPU 工作节点。\n![集群结构概览，由 Langhalsdino 设计](resources\u002FSystem-overview.jpg?raw=true \"概览图\")\n\n## 初始化节点\n在开始使用集群之前，必须先对集群进行初始化。\u003C\u002Fbr>\n为此，每个节点都需要手动初始化并加入集群。\n\n### 我的设置限制\n以下是我在设置过程中的一些约束条件。有些地方可能比我实际需要的更严格，但这正是我的配置方式，而且对我有效 😒  \n\n**主节点**\n\n+ Ubuntu 16.04\n+ 具有 sudo 用户权限的 SSH 访问\n+ 互联网连接\n+ ufw 已禁用（不推荐，但为了方便操作）\n+ 开放端口（UDP 和 TCP）\n    - 6443、443、8080\n    - 30000–32767（仅当你的应用需要时）\n    - 这些端口将用于从集群外部访问服务\n\n**工作节点**\n\n+ Ubuntu 16.04\n+ 具有 sudo 用户权限的 SSH 访问\n+ 互联网连接\n+ ufw 已禁用（不推荐，但为了方便操作）\n+ 开放端口（UDP 和 TCP）\n    - 6443、443\n\n### 设置步骤\n这些说明基于我在 Ubuntu 16.04 上的经验，可能并不完全适用于其他操作系统。\n\n我已经创建了两个脚本，分别用于按照下文所述完全初始化主节点和工作节点。如果你想走捷径，可以直接使用这些脚本。否则，建议仔细阅读详细的逐步说明。\n\n\u003Ch4>捷径——设置脚本\u003C\u002Fh4>\n好的，让我们走捷径吧。将相应的脚本复制到你的主节点和工作节点上。\u003C\u002Fbr>\n此外，请确保你的设置符合我的限制条件。\n\n**主节点**\n\n执行初始化脚本，并记住生成的令牌 😉 \u003Cbr\u002F>\n令牌看起来像这样：```—token f38242.e7f3XXXXXXXXe231e```\n\n```\nchmod +x init-master.sh\nsudo .\u002Finit-master.sh \u003C主节点IP>\n```\n\n**工作节点**\n\n使用正确的令牌和主节点 IP 执行初始化脚本。\u003Cbr\u002F>\n通常使用的端口是 ```6443```。\n\n```\nchmod +x init-worker.sh\nsudo .\u002Finit-worker.sh \u003C主节点令牌> \u003C主节点IP>:\u003C端口>\n```\n\n\u003Ch4>详细步骤说明\u003C\u002Fh4>\n\n**主节点**\n\n**1.** 将 Kubernetes 仓库添加到包管理器\n```\nsudo su -\napt-get update && apt-get install -y apt-transport-https\ncurl -s https:\u002F\u002Fpackages.cloud.google.com\u002Fapt\u002Fdoc\u002Fapt-key.gpg | apt-key add -\ncat \u003C\u003CEOF >\u002Fetc\u002Fapt\u002Fsources.list.d\u002Fkubernetes.list\ndeb http:\u002F\u002Fapt.kubernetes.io\u002F kubernetes-xenial main\nEOF\napt-get update\nexit\n```\n\n**2.** 安装 docker-engine、kubeadm、kubectl 和 kubernetes-cni\n\n```\nsudo apt-get install -y docker-engine\nsudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni\nsudo groupadd docker\nsudo usermod -aG docker $USER\necho '你可能需要重启或重新登录才能使 Docker 正常工作'\n```\n\n**3.** 由于我们要构建一个使用 GPU 的集群，因此需要在主节点中启用 GPU 加速功能。\n请注意，这条指令在未来版本的 Kubernetes 中可能会失效或完全改变！\n\n**3.I**\n在集群尚未初始化的情况下，为 Kubeadm 配置添加 GPU 支持。\n```\nsudo vim \u002Fetc\u002Fsystemd\u002Fsystem\u002Fkubelet.service.d\u002F\u003C\u003CNumber>>-kubeadm.conf\n```\n在 ExecStart 行中添加标志 ```—feature-gates=\"Accelerators=true\"```, 修改后的行如下：\n```\nExecStart=\u002Fusr\u002Fbin\u002Fkubelet $KUBELET_KUBECONFIG_ARGS [...] --feature-gates=\"Accelerators=true\"\n```\n\n**3.II** 重启 kubelet\n```\nsudo systemctl daemon-reload\nsudo systemctl restart kubelet\n```\n\n**4.** 现在我们将初始化主节点。\u003Cbr\u002F>\n因此，您需要主节点的 IP 地址。\n此外，此步骤将为您提供用于添加更多工作节点的凭据，请记住您的令牌 😉 \u003C\u002Fbr>\n该令牌看起来如下：``` —token f38242.e7f3XXXXXXXXe231e 130.211.XXX.XXX:6443```\n```\nsudo kubeadm init --apiserver-advertise-address=\u003Cip-address>\n```\n**5.** 自 Kubernetes 1.6 从 ABAC 角色管理改为 RBAC 后，我们需要公开用户的凭据。\n每次登录机器时，您都需要执行此步骤！！\n```\nsudo cp \u002Fetc\u002Fkubernetes\u002Fadmin.conf $HOME\u002F\nsudo chown $(id -u):$(id -g) $HOME\u002Fadmin.conf\nexport KUBECONFIG=$HOME\u002Fadmin.conf\n```\n\n**6.** 安装网络插件，以便您的 Pod 可以相互通信。Kubernetes 1.6 对网络插件有一些要求，其中一些是：\n\n + 基于 CNI 的网络\n + 支持 RBAC\n\n此 Google 表格包含一些合适的网络插件选择：GoogleSheet-Network-Add-on-vergleich。\n我将使用 wave-works，只是因为个人偏好 ;)\n```\nkubectl apply -f https:\u002F\u002Fgit.io\u002Fweave-kube-1.6\n```\n**5.II** 您已经准备就绪，不妨检查一下您的 Pod，确认一切正常运作；)\n```\nkubectl get pods —all-namespaces\n```\n**N.** 如果您想拆除主节点，您需要重置主节点\n```\nsudo kubeadm reset\n```\n\n**工作节点**\n\n开始部分对您来说应该很熟悉，这将使整个过程快很多 ;)\n\n**1.** 将 Kubernetes 仓库添加到包管理器\n```\nsudo su -\napt-get update && apt-get install -y apt-transport-https\ncurl -s https:\u002F\u002Fpackages.cloud.google.com\u002Fapt\u002Fdoc\u002Fapt-key.gpg | apt-key add -\ncat \u003C\u003CEOF >\u002Fetc\u002Fapt\u002Fsources.list.d\u002Fkubernetes.list\ndeb http:\u002F\u002Fapt.kubernetes.io\u002F kubernetes-xenial main\nEOF\napt-get update\nexit\n```\n\n**2.** 安装 docker-engine、kubeadm、kubectl 和 kubernetes-cni\n```\nsudo apt-get install -y docker-engine\nsudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni\nsudo groupadd docker\nsudo usermod -aG docker $USER\necho '您可能需要重启或重新登录才能使 Docker 正常工作'\n```\n\n**3.** 由于我们希望构建一个使用 GPU 的集群，因此需要在安装了 GPU 的工作节点上启用 GPU 加速。\n请注意，此说明可能会在后续版本的 Kubernetes 中过时或完全改变！\n\n**3.I**\n在集群未初始化的情况下，为 Kubeadm 配置添加 GPU 支持。\n```\nsudo vim \u002Fetc\u002Fsystemd\u002Fsystem\u002Fkubelet.service.d\u002F\u003C\u003CNumber>>-kubeadm.conf\n```\n在 ExecStart 中添加标志 ```—feature-gates=\"Accelerators=true\"```, 使其看起来如下：\n```\nExecStart=\u002Fusr\u002Fbin\u002Fkubelet $KUBELET_KUBECONFIG_ARGS [...] --feature-gates=\"Accelerators=true\"\n```\n\n**3.II** 重启 kubelet\n```\nsudo systemctl daemon-reload\nsudo systemctl restart kubelet\n```\n\n**4.** 现在我们将把工作节点加入集群。\u003Cbr\u002F>\n因此，您需要记住主节点的令牌，请仔细查看您的笔记 xD\n```\nsudo kubeadm join --token f38242.e7f3XXXXXXe231e 130.211.XXX.XXX:6443\n```\n**5.** 完成！在主节点上检查您的节点，看看是否一切正常。\n```\nkubectl get nodes\n```\n**N.** 如果您想拆除工作节点，您需要先从集群中移除该节点，然后再重置工作节点。\n此外，最好也从集群中移除该工作节点。\u003Cbr\u002F>\n***在主节点上：***\n```\nkubectl delete node \u003C工作节点名称 >\n```\n***在工作节点上：***\n```     \nsudo kubeadm reset\n```\n\n**客户端**\n\n为了从客户端控制您的集群（例如主节点），您需要使用正确的用户身份验证客户端。\n本指南不会介绍如何为客户端创建单独的用户，我们将直接从主节点复制用户。\u003Cbr\u002F>\n这样会更简单，相信我 🤓 \u003C\u002Fbr>\n[未来将添加关于添加自定义用户的说明]\n\n**1.** 在您的客户端上安装 kubectl。我只在我的 Mac 上测试过，但 Linux 也应该可以运行。\n我不太清楚 Windows 的情况，不过谁会在乎 Windows 呢 :D\u003C\u002Fbr>\n**在 Mac 上**\n```\nbrew install kubectl\n```\n**2.** 将主节点上的管理员认证信息复制到您的客户端\n```\nscp uai@130.211.XXX.64:~\u002Fadmin.conf ~\u002F.kube\u002F\n```\n**3.** 将 admin.conf 配置和凭据添加到 Kubernetes 配置中。每个代理都需要执行此操作\n```\nexport KUBECONFIG=~\u002F.kube\u002Fadmin.conf\n```\n现在您就可以在本地客户端上使用 kubectl 了。\n\n**3.II** 您可以通过列出所有 Pod 来进行测试\n```\nkubectl get pods —all-namespaces\n```\n\n\n**安装 Kubernetes 仪表板**\n\nKubernetes 仪表板非常漂亮，它让我这个“脚本小子”也能访问许多功能。\n要使用仪表板，您需要让客户端正常运行，RBAC 将确保这一点 👮\n\n**您可以直接在主节点上或从客户端执行这些步骤**\n\n**1.** 检查仪表板是否已安装\nkubectl get pods --all-namespaces | grep dashboard\n\n**2.** 如果仪表板尚未安装，请安装它 ;) \n```\nkubectl create -f https:\u002F\u002Fgit.io\u002Fkube-dashboard\n```\n如果这一步没有成功，请检查 .yaml 文件 [git.io\u002Fkube-dashboard](https\u002Fgit.io\u002Fkube-dashboard) 中定义的容器是否存在。（这个错误曾让我浪费了很多时间）\n\n要访问您的仪表板，您需要通过客户端进行身份验证。\n\n**3.** 将仪表板代理到您的客户端\n```\nkubectl proxy\n```\n\n**4.** 在浏览器中访问仪表板：\n[127.0.0.1:8001\u002Fui](127.0.0.1:8001\u002Fui)\n\n\n\n## 如何构建 GPU 容器\n本指南旨在帮助您运行需要 GPU 访问权限的 Docker 容器。\n\n为此，我选择构建一个示例 Docker 容器，该容器使用 TensorFlow GPU 二进制文件，并可在 Jupyter Notebook 中运行 TensorFlow 程序。\n\n请注意，本指南是针对 Kubernetes 1.6 编写的，因此后续的更改可能会导致本指南失效。\n\n### .yml 文件的关键部分\n为了让搭载 CUDA 的 NVIDIA GPU 正常运行，你需要将 NVIDIA 驱动和 CUDA 库传递到你的容器中。\n因此，我们将使用 hostPath 将这些文件挂载到 Kubernetes Pod 中。实际路径因机器而异，具体取决于你安装的 NVIDIA 驱动和 CUDA 版本。\n```\nvolumes:\n    - hostPath:\n        path: \u002Fusr\u002Flib\u002Fnvidia-375\u002Fbin \n        name: bin\n    - hostPath:\n        path: \u002Fusr\u002Flib\u002Fnvidia-375\n        name: lib\n```\n将包含驱动和 CUDA 库的卷挂载到容器中的正确目录。这些路径可能会因容器的具体需求而有所不同。\n```\nvolumeMounts:\n    - mountPath: \u002Fusr\u002Flocal\u002Fnvidia\u002Fbin\n        name: bin\n    - mountPath: \u002Fusr\u002Flocal\u002Fnvidia\u002Flib\n        name: lib\n```\n由于你需要告知 Kubernetes 你所需的 GPU 数量，可以在资源限制中定义这一需求。\n```\nresources:\n    limits:\n        alpha.kubernetes.io\u002Fnvidia-gpu: 1\n```\n好了，这就是构建 Kubernetes 1.6 容器所需的一切 😏\n\n最后附上一些总结我整体体验的注释：\u003Cbr\u002F>\n**Kubernetes + Docker + 机器学习 + GPUs = 简直太棒了**\n\n### GPU 部署示例\n我的 example-gpu-deployment.yaml 文件描述了两个部分：一个 Deployment 和一个 Service，因为我想让 Jupyter Notebook 能够从外部访问。\n\n运行 kubectl apply 命令使其对外可用：\n```\nkubectl create -f deployment.yaml\n```\n\ndeployment.yaml 文件内容如下：\n```\n---\napiVersion: extensions\u002Fv1beta1\nkind: Deployment\nmetadata:\n  name: tf-jupyter\nspec:\n  replicas: 1\n  template:\n    metadata:\n      labels:\n        app: tf-jupyter\n    spec:\n      volumes:\n      - hostPath:\n          path: \u002Fusr\u002Flib\u002Fnvidia-375\u002Fbin\n        name: bin\n      - hostPath:\n          path: \u002Fusr\u002Flib\u002Fnvidia-375\n        name: lib\n      containers:\n      - name: tensorflow\n        image: tensorflow\u002Ftensorflow:0.11.0rc0-gpu\n        ports:\n        - containerPort: 8888\n        resources:\n          limits:\n            alpha.kubernetes.io\u002Fnvidia-gpu: 1\n        volumeMounts:\n        - mountPath: \u002Fusr\u002Flocal\u002Fnvidia\u002Fbin\n          name: bin\n        - mountPath: \u002Fusr\u002Flocal\u002Fnvidia\u002Flib\n          name: lib\n---\napiVersion: v1\nkind: Service\nmetadata:\n  name: tf-jupyter-service\n  labels:\n    app: tf-jupyter\nspec:\n  selector:\n    app: tf-jupyter\n  ports:\n  - port: 8888\n    protocol: TCP\n    nodePort: 30061\n  type: LoadBalancer\n---\n```\n\n## 一些实用命令\n\n**获取命令**（基础输出）\n```\nkubectl get services                 # 列出命名空间中的所有服务\nkubectl get pods --all-namespaces    # 列出所有命名空间中的所有 Pod\nkubectl get pods -o wide             # 列出命名空间中的所有 Pod，并显示更多详细信息\nkubectl get deployment my-dep        # 列出特定的 Deployment\n```\n\n**描述命令**（详细输出）\n```\nkubectl describe nodes \u003Cnode-name>\nkubectl describe pods \u003Cpod-name>\n```\n\n**删除资源**\n```\nkubectl delete -f .\u002Fpod.yaml                   # 根据 pod.yaml 中指定的类型和名称删除 Pod\nkubectl delete pod,service baz foo             # 删除名为“baz”和“foo”的 Pod 和 Service\nkubectl delete pods,services -l name=\u003CLabel>   # 删除带有标签 name=myLabel 的 Pod 和 Service\nkubectl -n \u003Cnamespace> delete po,svc --all     # 删除命名空间 my-ns 中的所有 Pod 和 Service\n```\n\n**进入某个 Pod 的 Bash 控制台**：\n```\nkubectl exec -it \u003Cpod-name> — \u002Fbin\u002Fbash\n```\n\n## 常见问题\n\n有些人联系我，反馈他们在部署 CUDA 时遇到了驱动转发方面的问题。\u003Cbr>\n如果 example-gpu-deployment.yaml 对你不起作用，我建议你按照这篇指南 [在 Ubuntu 上安装 TensorFlow](http:\u002F\u002Fsimonboehm.com\u002Ftech\u002F2017\u002F06\u002F23\u002FinstallingTensorFlow.html) 的说明更详细地安装 CUDA，然后尝试使用 example-gpu-deployment-nvidia-375-82.yaml。可能需要调整 YAML 文件中的版本号。\n\n如果你遇到其他问题，请随时在 GitHub 上提交 issue。\n\n## 致谢\n有许多指南、GitHub 仓库、issue 和社区成员帮助了我很多。\u003C\u002Fbr>\n在此向所有提供帮助的人表示感谢。\u003C\u002Fbr>\n特别感谢初创公司 [understand.ai](http:\u002F\u002Funderstand.ai) 的支持。\n\n### 作者\n* **Frederic Tausch** - *初始工作* - [Langhalsdino](https:\u002F\u002Fgithub.com\u002FLanghalsdino)\n\n### 许可证\n该项目采用 MIT 许可证授权，详情请参阅 [LICENSE.md](LICENSE.md) 文件。","# Kubernetes GPU 集群快速上手指南\n\n本指南旨在帮助开发者在多台 Ubuntu 服务器上快速搭建 Kubernetes GPU 集群，以实现深度学习训练的自动化与加速。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Ubuntu 16.04 (所有节点)\n- **硬件架构**:\n  - **Master 节点**: 仅需 CPU，负责控制集群。\n  - **Worker 节点**: 需配备 NVIDIA GPU，负责实际计算任务。\n- **网络权限**:\n  - 所有节点需具备互联网访问权限。\n  - 所有节点需配置 SSH 访问且拥有 `sudo` 权限用户。\n  - **防火墙**: 建议暂时关闭 `ufw` 以简化配置（生产环境请按需开放端口）。\n    - Master 开放端口：`6443`, `443`, `8080`, `30000-32767`\n    - Worker 开放端口：`6443`, `443`\n\n### 前置依赖\n确保各节点已安装基础工具，后续安装脚本将自动处理以下组件：\n- Docker Engine\n- Kubeadm, Kubectl, Kubelet\n- Kubernetes CNI 插件\n\n> **注意**：本指南基于原文提供的脚本和步骤，主要针对 Ubuntu 16.04。如需国内加速，建议在执行 `apt-get` 前替换为阿里云或清华镜像源。\n\n## 安装步骤\n\n你可以选择**快速脚本模式**或**手动分步模式**。\n\n### 方案一：快速脚本模式 (Fast Track)\n\n如果你确认环境符合上述约束，可直接使用项目提供的脚本。\n\n**1. 初始化 Master 节点**\n在 Master 节点执行以下命令（替换 `\u003CIP-of-master>` 为你的 Master 节点 IP）：\n```bash\nchmod +x init-master.sh\nsudo .\u002Finit-master.sh \u003CIP-of-master>\n```\n*执行成功后，终端会输出一个 Token（例如：`--token f38242.e7f3XXXXXXXXe231e`），请务必记录该 Token 及 Master IP。*\n\n**2. 加入 Worker 节点**\n在每个 Worker 节点执行以下命令（替换 `\u003CToken-of-Master>` 和 `\u003CIP-of-master>:\u003CPort>`）：\n```bash\nchmod +x init-worker.sh\nsudo .\u002Finit-worker.sh \u003CToken-of-Master> \u003CIP-of-master>:6443\n```\n\n---\n\n### 方案二：手动分步模式 (Detailed Step-by-Step)\n\n#### A. 配置 Master 节点\n\n**1. 添加 Kubernetes 软件源**\n```bash\nsudo su -\napt-get update && apt-get install -y apt-transport-https\ncurl -s https:\u002F\u002Fpackages.cloud.google.com\u002Fapt\u002Fdoc\u002Fapt-key.gpg | apt-key add -\ncat \u003C\u003CEOF >\u002Fetc\u002Fapt\u002Fsources.list.d\u002Fkubernetes.list\ndeb http:\u002F\u002Fapt.kubernetes.io\u002F kubernetes-xenial main\nEOF\napt-get update\nexit\n```\n\n**2. 安装核心组件**\n```bash\nsudo apt-get install -y docker-engine\nsudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni\nsudo groupadd docker\nsudo usermod -aG docker $USER\n# 可能需要重启或重新登录以使 docker 生效\n```\n\n**3. 启用 GPU 支持特性**\n编辑 kubelet 配置文件（文件名中的 `\u003C\u003CNumber>>` 需替换为实际数字）：\n```bash\nsudo vim \u002Fetc\u002Fsystemd\u002Fsystem\u002Fkubelet.service.d\u002F\u003C\u003CNumber>>-kubeadm.conf\n```\n在 `ExecStart` 行末尾添加 `--feature-gates=\"Accelerators=true\"`，修改后示例如下：\n```text\nExecStart=\u002Fusr\u002Fbin\u002Fkubelet $KUBELET_KUBECONFIG_ARGS [...] --feature-gates=\"Accelerators=true\"\n```\n重启 kubelet 服务：\n```bash\nsudo systemctl daemon-reload\nsudo systemctl restart kubelet\n```\n\n**4. 初始化集群**\n```bash\nsudo kubeadm init --apiserver-advertise-address=\u003Cip-address>\n```\n*记录输出的 Token 和加入命令。*\n\n**5. 配置管理员权限**\n每次登录均需执行以下命令以加载配置：\n```bash\nsudo cp \u002Fetc\u002Fkubernetes\u002Fadmin.conf $HOME\u002F\nsudo chown $(id -u):$(id -g) $HOME\u002Fadmin.conf\nexport KUBECONFIG=$HOME\u002Fadmin.conf\n```\n\n**6. 安装网络插件 (CNI)**\n此处以 Weave Net 为例：\n```bash\nkubectl apply -f https:\u002F\u002Fgit.io\u002Fweave-kube-1.6\n```\n验证 Pod 状态：\n```bash\nkubectl get pods --all-namespaces\n```\n\n#### B. 配置 Worker 节点\n\n**1 & 2. 添加源并安装组件**\n步骤同 Master 节点的第 1、2 步。\n\n**3. 启用 GPU 支持特性**\n步骤同 Master 节点的第 3 步（编辑配置文件并重启 kubelet）。\n\n**4. 加入集群**\n使用 Master 节点生成的 Token 执行加入命令：\n```bash\nsudo kubeadm join --token f38242.e7f3XXXXXXe231e 130.211.XXX.XXX:6443\n```\n\n**5. 验证节点**\n回到 **Master 节点** 执行：\n```bash\nkubectl get nodes\n```\n若状态为 `Ready`，则集群搭建成功。\n\n#### C. 配置客户端 (可选)\n\n若需在本地电脑（Mac\u002FLinux）管理集群：\n\n1. 安装 kubectl (Mac 示例):\n   ```bash\n   brew install kubectl\n   ```\n2. 从 Master 复制配置文件:\n   ```bash\n   scp uai@130.211.XXX.64:~\u002Fadmin.conf ~\u002F.kube\u002F\n   ```\n3. 设置环境变量:\n   ```bash\n   export KUBECONFIG=~\u002F.kube\u002Fadmin.conf\n   ```\n4. 测试连接:\n   ```bash\n   kubectl get pods --all-namespaces\n   ```\n\n## 基本使用\n\n### 1. 查看集群状态\n确认所有节点和 Pod 运行正常：\n```bash\nkubectl get nodes\nkubectl get pods --all-namespaces\n```\n\n### 2. 部署 Kubernetes Dashboard (可视化界面)\n检查是否已安装：\n```bash\nkubectl get pods --all-namespaces | grep dashboard\n```\n若未安装，执行：\n```bash\nkubectl create -f https:\u002F\u002Fgit.io\u002Fkube-dashboard\n```\n启动代理并在浏览器访问：\n```bash\nkubectl proxy\n# 访问地址通常为：http:\u002F\u002Flocalhost:8001\u002Fapi\u002Fv1\u002Fnamespaces\u002Fkube-system\u002Fservices\u002Fkubernetes-dashboard\u002Fproxy\n```\n\n### 3. 构建与运行 GPU 容器\n本指南的核心目标是运行深度学习任务。你需要构建包含 CUDA 环境的 Docker 镜像，并通过 YAML 文件部署到集群。\n\n一个简单的部署逻辑如下（需自行编写 `.yaml` 文件）：\n- 在 Pod 定义中请求 GPU 资源。\n- 指定包含深度学习框架（如 TensorFlow\u002FPyTorch）的镜像。\n- 挂载数据卷进行训练。\n\n示例命令用于查看当前命名空间下的资源：\n```bash\nkubectl get all\n```\n\n> **提示**：若需重置节点，Master 执行 `sudo kubeadm reset`，Worker 需先在 Master 上执行 `kubectl delete node \u003Cnode-name>` 后再在 Worker 上执行 reset。","某高校计算机视觉实验室的研究团队需要在自建的多台 Ubuntu 裸金属服务器上，大规模训练基于深度学习的自动驾驶目标检测模型。\n\n### 没有 Kubernetes-GPU-Guide 时\n- **环境配置繁琐且易错**：研究人员需手动在每台服务器上安装 NVIDIA 驱动、Docker 及 Kubernetes 组件，常因版本不兼容导致集群搭建失败，耗时数天。\n- **资源调度混乱**：缺乏统一的 GPU 资源管理机制，多人在同一台机器上训练时经常发生显存冲突，导致任务意外中断。\n- **本地与云端割裂**：算法在本地调试成功后，迁移到服务器集群进行大规模训练时，需反复修改配置脚本，部署过程充满陷阱且效率极低。\n- **扩展性差**：当需要增加新的 GPU 节点以应对更大数据集时，重新配置网络和新节点加入集群的过程极其复杂，难以快速响应实验需求。\n\n### 使用 Kubernetes-GPU-Guide 后\n- **一键自动化部署**：利用指南提供的脚本和 YAML 文件，团队可在短时间内自动完成从主节点到多个 GPU 工作节点的全套环境初始化，大幅降低上手门槛。\n- **高效的资源隔离与调度**：通过构建标准的 GPU 容器，Kubernetes 自动将训练任务分发到空闲的 GPU 节点，彻底解决了显存争抢问题，实现多任务并行稳定运行。\n- **流畅的云端训练工作流**：研究人员只需将本地算法打包为容器镜像，即可无缝提交至集群进行超参数搜索和大数据集训练，消除了本地到云端的迁移摩擦。\n- **弹性伸缩便捷**：新增服务器时，仅需运行简单的加入命令即可将其纳入集群统一管理，轻松支撑起不断增长的算力需求。\n\nKubernetes-GPU-Guide 将原本痛苦且耗时的集群搭建与运维过程转化为标准化的两步工作流，让研究人员能专注于算法创新而非基础设施维护。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLanghalsdino_Kubernetes-GPU-Guide_2a5f65b0.png","Langhalsdino","Frederic Tausch","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FLanghalsdino_2c003061.jpg","CTO @apic-ai 🐝  and kubernetes, computer vision and edge computing enthusiast.","apic.ai","Karlsruhe, Germany","github@tausch.me","langhalsdino",null,"https:\u002F\u002Fgithub.com\u002FLanghalsdino",[83,87],{"name":84,"color":85,"percentage":86},"Shell","#89e051",77.4,{"name":88,"color":89,"percentage":90},"Jupyter Notebook","#DA5B0B",22.6,816,112,"2026-03-05T11:08:43","MIT",5,"Linux (Ubuntu 16.04)","需要 NVIDIA GPU（用于工作节点），需安装 NVIDIA 驱动以支持 Docker GPU 加速，具体型号和显存未说明","未说明",{"notes":100,"python":98,"dependencies":101},"该工具用于在裸金属服务器上搭建 Kubernetes GPU 集群。主节点和工作节点均需运行 Ubuntu 16.04 并拥有 sudo 权限。默认配置要求关闭防火墙（ufw）。客户端支持 macOS 和 Linux 用于集群管理，不支持 Windows。需在 kubelet 配置中手动开启 'Accelerators=true' 特性门控以支持 GPU。",[102,103,104,105,106],"docker-engine","kubeadm","kubectl","kubelet","kubernetes-cni",[14],[109,110,111,112,113,114,115,116,117,118,119],"kubernetes","kubernetes-cluster","kubernetes-setup","deep-learning","gpu-computing","distributed-systems","guide","kubernetes-gpu-cluster","cluster","gpu","worker-nodes","2026-03-27T02:49:30.150509","2026-04-19T03:03:16.461908",[123,128,132,136,141,145],{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},41137,"如何在 Kubernetes 中正确配置不同版本的 CUDA 和 cuDNN？","推荐的做法是：在同一台机器上只使用相同版本的驱动、CUDA 和 cuDNN，并使用标签（labels）标记该设置。然后配置每个作业或部署以请求特定的标签。\n\n如果确实需要在同一台机器上运行多个版本：\n1. Kubernetes 目前仅支持同一台机器上使用相同的驱动程序。由于 CUDA 大多向后兼容，您可以安装不同版本。\n2. 将每个版本安装在不同位置（例如 \u002Fusr\u002Flocal\u002Fcuda-$V, \u002Fusr\u002Flocal\u002Fcuda-$V\u002Fcuda）。\n3. 在每个部署中指定要使用的版本 ($V)。\n\n另一种解决方案是在不同的机器上使用不同版本的 CUDA\u002FcuDNN，并根据版本进行标记，这已被验证可行。","https:\u002F\u002Fgithub.com\u002FLanghalsdino\u002FKubernetes-GPU-Guide\u002Fissues\u002F4",{"id":129,"question_zh":130,"answer_zh":131,"source_url":127},41138,"容器内的 LD_LIBRARY_PATH 变量是如何设置的？手动修改它是否安全？","在标准的 TensorFlow Docker 镜像中，LD_LIBRARY_PATH 通常设置为 \u002Fusr\u002Flocal\u002Fcuda... 路径。但在 Kubernetes GPU 设置中，它可能被设置为 \u002Fusr\u002Flocal\u002Fnvidia\u002Flib:\u002Fusr\u002Flocal\u002Fnvidia\u002Flib64 以共享宿主机的驱动。\n\n虽然手动在运行中的容器内将 LD_LIBRARY_PATH 设置为 \u002Fusr\u002Flocal\u002Fcuda..\u002Flib64 可能暂时生效，但这并不是推荐的持久化方案。正确的做法是通过部署配置（如环境变量或挂载卷）来明确指定库路径，或者遵循“每台机器统一版本并打标签”的最佳实践，以避免版本冲突。",{"id":133,"question_zh":134,"answer_zh":135,"source_url":127},41139,"是否可以在同一个节点上运行具有不同 CUDA 版本的容器？","原则上可以，但有限制。Kubernetes 目前要求同一台机器上的驱动程序版本必须相同。由于 CUDA 通常具有向后兼容性，您可以在同一台机器上安装不同版本的 CUDA Toolkit（安装在不同目录，如 \u002Fusr\u002Flocal\u002Fcuda-10.0, \u002Fusr\u002Flocal\u002Fcuda-11.0）。\n\n要实现这一点，您需要在部署文件中明确指定容器应使用哪个版本的 CUDA 路径。然而，更稳定且推荐的方案是使用节点标签（Node Labels），将不同 CUDA 版本的节点区分开，并将任务调度到对应标签的节点上，而不是强行在同一节点混合运行。",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},41140,"执行 kubeadm init 时出现 'cannot automatically set CgroupDriver' 和 'docker info' 错误怎么办？","此错误通常发生在 Ubuntu 18.04+ 系统上，原因是 kubeadm 无法自动检测 Docker 的 Cgroup 驱动器，或者 Docker 版本过旧不支持 '--format' 参数。\n\n解决方法包括：\n1. 确保安装的 Docker 版本较新，支持 'docker info -f {{.CgroupDriver}}' 命令。\n2. 检查 Docker 服务状态，确保其正常运行。\n3. 如果使用的是 systemd，可能需要显式配置 kubelet 的 cgroup 驱动器与 Docker 一致（通常都是 systemd）。\n4. 错误日志中提到的 'flag provided but not defined: --format' 表明当前的 docker 客户端版本太老，请升级 Docker。","https:\u002F\u002Fgithub.com\u002FLanghalsdino\u002FKubernetes-GPU-Guide\u002Fissues\u002F8",{"id":142,"question_zh":143,"answer_zh":144,"source_url":140},41141,"执行 kubeadm init 时提示 'running with swap on is not supported' 如何解决？","Kubernetes 默认不支持开启 Swap 分区。如果在预检检查（preflight checks）中遇到此错误，必须禁用 Swap。\n\n解决步骤：\n1. 临时禁用 Swap：运行命令 'sudo swapoff -a'。\n2. 永久禁用 Swap：编辑 '\u002Fetc\u002Ffstab' 文件，注释掉或删除包含 'swap' 的行，防止重启后再次启用。\n3. 重新运行 'kubeadm init' 命令。",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},41142,"该项目与 Pipeline.io 相比有什么优势或区别？","Pipeline.io 致力于提供一个专门的机器学习平台，而 Kubernetes 采取了一种更通用的分布式计算方法，允许更高程度的定制。\n\n如果您需要为其他应用程序（不仅仅是 ML）进行分布式计算，构建一个统一的 Kubernetes 基础设施可能比同时维护两套系统（Kubernetes + Pipeline.io）更容易且合理。当然，这取决于具体需求程度。如果您只需要专门的 ML 流水线功能，Pipeline.io 可能更开箱即用；如果需要底层灵活性和通用性，Kubernetes 是更好的选择。","https:\u002F\u002Fgithub.com\u002FLanghalsdino\u002FKubernetes-GPU-Guide\u002Fissues\u002F3",[]]