[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-ayaka14732--tpu-starter":3,"tool-ayaka14732--tpu-starter":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":75,"owner_website":82,"owner_url":83,"languages":84,"stars":93,"forks":94,"last_commit_at":95,"license":96,"difficulty_score":97,"env_os":98,"env_gpu":99,"env_ram":100,"env_deps":101,"category_tags":110,"github_topics":111,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":119,"updated_at":120,"faqs":121,"releases":145},217,"ayaka14732\u002Ftpu-starter","tpu-starter","Everything you want to know about Google Cloud TPU","tpu-starter 是一个面向开发者和研究人员的开源指南，旨在帮助用户快速上手 Google Cloud TPU（张量处理单元）。它系统性地解答了初学者常见的疑问，比如如何免费申请 TPU 资源、TPU VM 与 TPU Pod 的区别、如何配置开发环境、远程连接调试等，并提供了详细的实操步骤。项目还涵盖了 JAX 框架的最佳实践，包括随机数管理、数据类型转换、优化器使用等关键技巧，帮助用户高效发挥 TPU 的并行计算能力。特别值得一提的是，tpu-starter 不仅讲解基础用法，还整合了 TRC（TPU Research Cloud）免费计划的申请流程和多机共享存储、资源监控等进阶配置，大幅降低了 TPU 的使用门槛。无论是刚接触 TPU 的学生，还是希望规模化训练模型的研究人员，都能从中获得清晰、实用的指导。","# TPU Starter\n\n\u003Ch4 align=\"center\">\n    \u003Cp>\n        \u003Cb>English\u003C\u002Fb> |\n        \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fayaka14732\u002Ftpu-starter\u002Fblob\u002Fmain\u002FREADME_ko.md\">한국어\u003C\u002Fa> |\n        \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fayaka14732\u002Ftpu-starter\u002Fblob\u002Fmain\u002FREADME_zh.md\">中文\u003C\u002Fa>\n    \u003Cp>\n\u003C\u002Fh4>\n\nEverything you want to know about Google Cloud TPU\n\n* [1. Community](#1-community)\n* [2. Introduction to TPU](#2-introduction-to-tpu)\n    * [2.1. Why TPU?](#21-why-tpu)\n    * [2.2. How can I get free access to TPU?](#22-how-can-i-get-free-access-to-tpu)\n    * [2.3. If TPU is so good, why do I rarely see others using it?](#23-if-tpu-is-so-good-why-do-i-rarely-see-others-using-it)\n    * [2.4. I know TPU is great now. Can I touch a TPU?](#24-i-know-tpu-is-great-now-can-i-touch-a-tpu)\n    * [2.5. What does it mean to create a TPU instance? What do I actually get?](#25-what-does-it-mean-to-create-a-tpu-instance-what-do-i-actually-get)\n* [3. Introduction to the TRC Program](#3-introduction-to-the-trc-program)\n    * [3.1. How do I apply for the TRC program?](#31-how-do-i-apply-for-the-trc-program)\n    * [3.2. Is it really free?](#32-is-it-really-free)\n* [4. Using TPU VM](#4-using-tpu-vm)\n    * [4.1. Create a TPU VM](#41-create-a-tpu-vm)\n    * [4.2. Add an SSH public key to Google Cloud](#42-add-an-ssh-public-key-to-google-cloud)\n    * [4.3. SSH into TPU VM](#43-ssh-into-tpu-vm)\n    * [4.4. Verify that TPU VM has TPU](#44-verify-that-tpu-vm-has-tpu)\n    * [4.5. Setting up the development environment in TPU VM](#45-setting-up-the-development-environment-in-tpu-vm)\n    * [4.6. Verify JAX is working properly](#46-verify-jax-is-working-properly)\n    * [4.7. Using Byobu to ensure continuous program execution](#47-using-byobu-to-ensure-continuous-program-execution)\n    * [4.8. Configure VSCode Remote-SSH](#48-configure-vscode-remote-ssh)\n    * [4.9. Using Jupyter Notebook on TPU VM](#49-using-jupyter-notebook-on-tpu-vm)\n* [5. Using TPU Pod](#5-using-tpu-pod)\n    * [5.1. Create a subnet](#51-create-a-subnet)\n    * [5.2. Disable Cloud Logging](#52-disable-cloud-logging)\n    * [5.3. Create TPU Pod](#53-create-tpu-pod)\n    * [5.4. SSH into TPU Pod](#54-ssh-into-tpu-pod)\n    * [5.5. Modify the SSH configuration file on Host 0](#55-modify-the-ssh-configuration-file-on-host-0)\n    * [5.6. Add the SSH public key of Host 0 to all hosts](#56-add-the-ssh-public-key-of-host-0-to-all-hosts)\n    * [5.7. Configure the podrun command](#57-configure-the-podrun-command)\n    * [5.8. Configure NFS](#58-configure-nfs)\n    * [5.9. Setting up the development environment in TPU Pod](#59-setting-up-the-development-environment-in-tpu-pod)\n    * [5.10. Verify JAX is working properly](#510-verify-jax-is-working-properly)\n* [6. TPU Best Practices](#6-tpu-best-practices)\n    * [6.1. Prefer Google Cloud Platform to Google Colab](#61-prefer-google-cloud-platform-to-google-colab)\n    * [6.2. Prefer TPU VM to TPU node](#62-prefer-tpu-vm-to-tpu-node)\n* [7. JAX Best Practices](#7-jax-best-practices)\n    * [7.1. Import convention](#71-import-convention)\n    * [7.2. Manage random keys in JAX](#72-manage-random-keys-in-jax)\n    * [7.3. Conversion between NumPy arrays and JAX arrays](#73-conversion-between-numpy-arrays-and-jax-arrays)\n    * [7.4. Conversion between PyTorch tensors and JAX arrays](#74-conversion-between-pytorch-tensors-and-jax-arrays)\n    * [7.5. Get the shapes of all parameters in a nested dictionary](#75-get-the-shapes-of-all-parameters-in-a-nested-dictionary)\n    * [7.6. The correct way to generate random numbers on CPU](#76-the-correct-way-to-generate-random-numbers-on-cpu)\n    * [7.7. Use optimizers from Optax](#77-use-optimizers-from-optax)\n    * [7.8. Use the cross-entropy loss implementation from Optax](#78-use-the-cross-entropy-loss-implementation-from-optax)\n* [8. How Can I...](#8-how-can-i)\n    * [8.1. Share files across multiple TPU VM instances](#81-share-files-across-multiple-tpu-vm-instances)\n    * [8.2. Monitor TPU usage](#82-monitor-tpu-usage)\n    * [8.3. Start a server on TPU VM](#83-start-a-server-on-tpu-vm)\n    * [8.4. Run separate processes on different TPU cores](#84-run-separate-processes-on-different-tpu-cores)\n* [9. Common Gotchas](#9-common-gotchas)\n    * [9.1. TPU VMs will be rebooted occasionally](#91-tpu-vms-will-be-rebooted-occasionally)\n    * [9.2. One TPU core can only be used by one process at a time](#92-one-tpu-core-can-only-be-used-by-one-process-at-a-time)\n    * [9.3. TCMalloc breaks several programs](#93-tcmalloc-breaks-several-programs)\n    * [9.4. libtpu.so already in used by another process](#94-libtpuso-already-in-used-by-another-process)\n    * [9.5. JAX does not support the multiprocessing fork strategy](#95-jax-does-not-support-the-multiprocessing-fork-strategy)\n\n\u003C!-- Created by https:\u002F\u002Fgithub.com\u002Fekalinin\u002Fgithub-markdown-toc -->\n\nThis project was inspired by [Cloud Run FAQ](https:\u002F\u002Fgithub.com\u002Fahmetb\u002Fcloud-run-faq), a community-maintained knowledge base about another Google Cloud product.\n\n## 1. Community\n\nGoogle's [official Discord server](https:\u002F\u002Fdiscord.com\u002Finvite\u002Fgoogle-dev-community) has established the `#tpu-research-cloud` channel.\n\n## 2. Introduction to TPU\n\n### 2.1. Why TPU?\n\n**TL;DR**: TPU is to GPU as GPU is to CPU.\n\nTPU is hardware specifically designed for machine learning. For performance comparisons, see [Performance Comparison](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fblob\u002Fmain\u002Fexamples\u002Fflax\u002Flanguage-modeling\u002FREADME.md#runtime-evaluation) in Hugging Face Transformers:\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fayaka14732_tpu-starter_readme_8efb5865ab0b.png)\n\nMoreover, Google's [TRC program](https:\u002F\u002Fsites.research.google\u002Ftrc\u002Fabout\u002F) offers free TPU resources to researchers. If you've ever wondered what computing resources to use to train a model, you should try the TRC program, as it's the best option I know of. More information about the TRC program is provided below.\n\n### 2.2. How can I get free access to TPU?\n\nResearchers can apply to the [TRC program](https:\u002F\u002Fsites.research.google\u002Ftrc\u002Fabout\u002F) to obtain free TPU resources.\n\n### 2.3. If TPU is so good, why do I rarely see others using it?\n\nIf you want to use PyTorch, TPU may not be suitable for you. TPU is poorly supported by PyTorch. In one of my past experiments using PyTorch, a batch took 14 seconds on a CPU but required 4 hours on a TPU. Twitter user @mauricetpunkt also thinks that [PyTorch's performance on TPUs is bad](https:\u002F\u002Ftwitter.com\u002Fmauricetpunkt\u002Fstatus\u002F1506944350281945090).\n\nIn conclusion, if you want to do deep learning with TPU, you should use JAX as your deep learning framework. In fact, many popular deep learning libraries support JAX. For instance:\n\n- [Many models in Hugging Face Transformers support JAX](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Findex#supported-frameworks)\n- [Keras supports using JAX as a backend](https:\u002F\u002Fkeras.io\u002Fkeras_core\u002Fannouncement\u002F)\n- SkyPilot has [examples using Flax](https:\u002F\u002Fgithub.com\u002Fskypilot-org\u002Fskypilot\u002Fblob\u002Fmaster\u002Fexamples\u002Ftpu\u002Ftpuvm_mnist.yaml)\n\nFurthermore, JAX's design is very clean and has been widely appreciated. For instance, JAX is my favorite open-source project. I've tweeted about [how JAX is better than PyTorch](https:\u002F\u002Ftwitter.com\u002Fayaka14732\u002Fstatus\u002F1688194164033462272).\n\n### 2.4. I know TPU is great now. Can I touch a TPU?\n\nUnfortunately, we generally can't physically touch a real TPU. TPUs are meant to be accessed via Google Cloud services.\n\nIn some exhibitions, TPUs are [displayed for viewing](https:\u002F\u002Ftwitter.com\u002Fwalkforhours\u002Fstatus\u002F1696654844134822130), which might be the closest you can get to physically touching one.\n\nPerhaps only by becoming a Google Cloud Infrastructure Engineer can one truly feel the touch of a TPU.\n\n### 2.5. What does it mean to create a TPU instance? What do I actually get?\n\nAfter creating a TPU v3-8 instance on [Google Cloud Platform](https:\u002F\u002Fcloud.google.com\u002Ftpu), you'll get a cloud server running the Ubuntu system with sudo privileges, 96 CPU cores, 335 GiB memory, and a TPU device with 8 cores (totalling 128 GiB TPU memory).\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fayaka14732_tpu-starter_readme_621e9297edca.png)\n\nIn fact, this is similar to how we use GPUs. Typically, when we use a GPU, we are using a Linux server connected to the GPU. Similarly, when we use a TPU, we're using a server connected to the TPU.\n\n## 3. Introduction to the TRC Program\n\n### 3.1. How do I apply for the TRC program?\n\nApart from the TRC program's [homepage](https:\u002F\u002Fsites.research.google\u002Ftrc\u002Fabout\u002F), Shawn wrote a wonderful article about the TRC program on [google\u002Fjax#2108](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fjax\u002Fissues\u002F2108#issuecomment-866238579). Anyone who is interested in TPU should read it immediately.\n\n### 3.2. Is it really free?\n\nFor the first three months, the TRC program is completely free due to the free trial credit given when registering for Google Cloud. After three months, I spend roughly HK$13.95 (about US$1.78) per month. This expense is for the network traffic of the TPU server, while the TPU device itself is provided for free by the TRC program.\n\n## 4. Using TPU VM\n\n### 4.1. Create a TPU VM\n\nOpen [Google Cloud Platform](https:\u002F\u002Fcloud.google.com\u002Ftpu) and navigate to the [TPU Management Page](https:\u002F\u002Fconsole.cloud.google.com\u002Fcompute\u002Ftpus).\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fayaka14732_tpu-starter_readme_215cc4221795.png)\n\nClick the console button on the top-right corner to activate Cloud Shell.\n\nIn Cloud Shell, type the following command to create a Cloud TPU v3-8 VM:\n\n```sh\nuntil gcloud alpha compute tpus tpu-vm create node-1 --project tpu-develop --zone europe-west4-a --accelerator-type v3-8 --version tpu-vm-base ; do : ; done\n```\n\nHere, `node-1` is the name of the TPU VM you want to create, and `--project` is the name of your Google Cloud project.\n\nThe above command will repeatedly attempt to create the TPU VM until it succeeds.\n\n### 4.2. Add an SSH public key to Google Cloud\n\nFor Google Cloud's servers, if you want to SSH into them, using `ssh-copy-id` is the wrong approach. The correct method is:\n\nFirst, type “SSH keys” into the Google Cloud webpage search box, go to the relevant page, then click edit, and add your computer's SSH public key.\n\nTo view your computer's SSH public key:\n\n```sh\ncat ~\u002F.ssh\u002Fid_rsa.pub\n```\n\nIf you haven't created an SSH key pair yet, use the following command to create one, then execute the above command to view:\n\n```sh\nssh-keygen -t rsa -f ~\u002F.ssh\u002Fid_rsa -N \"\"\n```\n\nWhen adding an SSH public key to Google Cloud, it's crucial to pay special attention to the value of the username. In the SSH public key string, the part preceding the `@` symbol at the end is the username. When added to Google Cloud, it will create a user with that name on all servers for the current project. For instance, with the string `ayaka@instance-1`, Google Cloud will create a user named `ayaka` on the server. If you wish for Google Cloud to create a different username, you can manually modify this string. Changing the mentioned string to `nixie@instance-1` would lead Google Cloud to create a user named `nixie`. Moreover, making such changes won't affect the functionality of the SSH key.\n\n### 4.3. SSH into TPU VM\n\nCreate or edit your computer's `~\u002F.ssh\u002Fconfig`:\n\n```sh\nnano ~\u002F.ssh\u002Fconfig\n```\n\nAdd the following content:\n\n```\nHost tpuv3-8-1\n    User nixie\n    Hostname 34.141.220.156\n```\n\nHere, `tpuv3-8-1` is an arbitrary name, `User` is the username created in Google Cloud from the previous step, and `Hostname` is the IP address of the TPU VM.\n\nThen, on your own computer, use the following command to SSH into the TPU VM:\n\n```sh\nssh tpuv3-8-1\n```\n\nWhere `tpuv3-8-1` is the name set in `~\u002F.ssh\u002Fconfig`.\n\n### 4.4. Verify that TPU VM has TPU\n\n```sh\nls \u002Fdev\u002Faccel*\n```\n\nIf the following output appears:\n\n```\n\u002Fdev\u002Faccel0  \u002Fdev\u002Faccel1  \u002Fdev\u002Faccel2  \u002Fdev\u002Faccel3\n```\n\nThis indicates that the TPU VM indeed has a TPU.\n\n### 4.5. Setting up the development environment in TPU VM\n\nUpdate software packages:\n\n```sh\nsudo apt-get update -y -qq\nsudo apt-get upgrade -y -qq\nsudo apt-get install -y -qq golang neofetch zsh byobu\n```\n\nInstall the latest Python 3.12:\n\n```sh\nsudo apt-get install -y -qq software-properties-common\nsudo add-apt-repository -y ppa:deadsnakes\u002Fppa\nsudo apt-get install -y -qq python3.12-full python3.12-dev\n```\n\nInstall Oh My Zsh:\n\n```sh\nsh -c \"$(curl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002Fohmyzsh\u002Fohmyzsh\u002Fmaster\u002Ftools\u002Finstall.sh)\" \"\" --unattended\nsudo chsh $USER -s \u002Fusr\u002Fbin\u002Fzsh\n```\n\nCreate a virtual environment (venv):\n\n```sh\npython3.12 -m venv ~\u002Fvenv\n```\n\nActivate the venv:\n\n```sh\n. ~\u002Fvenv\u002Fbin\u002Factivate\n```\n\nInstall JAX in the venv:\n\n```sh\npip install -U pip\npip install -U wheel\npip install -U \"jax[tpu]\" -f https:\u002F\u002Fstorage.googleapis.com\u002Fjax-releases\u002Flibtpu_releases.html\n```\n\n### 4.6. Verify JAX is working properly\n\nAfter activating the venv, use the following command to verify JAX is working:\n\n```sh\npython -c 'import jax; print(jax.devices())'\n```\n\nIf the output contains `TpuDevice`, this means JAX is working as expected.\n\n### 4.7. Using Byobu to ensure continuous program execution\n\nMany tutorials use the method of appending `&` to commands to run them in the background, so they continue executing even after exiting SSH. However, this is a basic method. The correct approach is to use a window manager like Byobu.\n\nTo run Byobu, simply use the `byobu` command. Then, execute commands within the opened window. To close the window, you can forcefully close the current window on your computer. Byobu will continue running on the server. The next time you connect to the server, you can retrieve the previous window using the `byobu` command.\n\nByobu has many advanced features. You can learn them by watching the official video [Learn Byobu while listening to Mozart](https:\u002F\u002Fyoutu.be\u002FNawuGmcvKus).\n\n### 4.8. Configure VSCode Remote-SSH\n\nOpen VSCode, access the Extensions panel on the left, search and install Remote - SSH.\n\nPress \u003Ckbd>F1\u003C\u002Fkbd> to open the command palette. Type ssh, click \"Remote-SSH: Connect to Host...\", then click on the server name set in `~\u002F.ssh\u002Fconfig` (e.g., `tpuv3-8-1`). Once VSCode completes the setup on the server, you can develop directly on the server with VSCode.\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fayaka14732_tpu-starter_readme_71ec2c3fe3d1.png)\n\nOn your computer, you can use the following command to quickly open a directory on the server:\n\n```sh\ncode --remote ssh-remote+tpuv3-8-1 \u002Fhome\u002Fayaka\u002Ftpu-starter\n```\n\nThis command will open the directory `\u002Fhome\u002Fayaka\u002Ftpu-starter` on `tpuv3-8-1` using VSCode.\n\n### 4.9. Using Jupyter Notebook on TPU VM\n\nAfter configuring VSCode with Remote-SSH, you can use Jupyter Notebook within VSCode. The result is as follows:\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fayaka14732_tpu-starter_readme_51815668f851.png)\n\nThere are two things to note here: First, in the top-right corner of the Jupyter Notebook interface, you should select the Kernel from `venv`, which refers to the `~\u002Fvenv\u002Fbin\u002Fpython` we created in the previous steps. Second, the first time you run it, you'll be prompted to install the Jupyter extension for VSCode and to install `ipykernel` within `venv`. You'll need to confirm these operations.\n\n## 5. Using TPU Pod\n\n### 5.1. Create a subnet\n\nTo create a TPU Pod, you first need to create a new VPC network and then create a subnet in the corresponding area of that network (e.g., `europe-west4-a`).\n\nTODO: Purpose?\n\n### 5.2. Disable Cloud Logging\n\nTODO: Reason? Steps?\n\n### 5.3. Create TPU Pod\n\nOpen Cloud Shell using the method described earlier for creating the TPU VM and use the following command to create a TPU v3-32 Pod:\n\n```sh\nuntil gcloud alpha compute tpus tpu-vm create node-1 --project tpu-advanced-research --zone europe-west4-a --accelerator-type v3-32 --version v2-alpha-pod --network advanced --subnetwork advanced-subnet-for-europe-west4 ; do : ; done\n```\n\nWhere `node-1` is the name you want for the TPU VM, `--project` is the name of your Google Cloud project, and `--network` and `--subnetwork` are the names of the network and subnet created in the previous step.\n\n### 5.4. SSH into TPU Pod\n\nSince the TPU Pod consists of multiple hosts, we need to choose one host, designate it as Host 0, and then SSH into Host 0 to execute commands. Given that the SSH public key added on the Google Cloud web page will be propagated to all hosts, every host can be directly connected through the SSH key, allowing us to designate any host as Host 0. The method to SSH into Host 0 is the same as for the aforementioned TPU VM.\n\n### 5.5. Modify the SSH configuration file on Host 0\n\nAfter SSH-ing into Host 0, the following configurations need to be made:\n\n```sh\nnano ~\u002F.ssh\u002Fconfig\n```\n\nAdd the following content:\n\n```\nHost 172.21.12.* 127.0.0.1\n    StrictHostKeyChecking no\n    UserKnownHostsFile \u002Fdev\u002Fnull\n    LogLevel ERROR\n```\n\nHere, `172.21.12.*` is determined by the IP address range of the subnet created in the previous steps. We use `172.21.12.*` because the IP address range specified when creating the subnet was 172.21.12.0\u002F24.\n\nWe need to do so because the `known_hosts` in ssh is created for preventing man-in-the-middle attacks. Since we are using an internal network environment here, we don't need to prevent such attacks or require this file, so we direct it to `\u002Fdev\u002Fnull`. Additionally, having `known_hosts` requires manually confirming the server's fingerprint during the first connection, which is unnecessary in an internal network environment and is not conducive to automation.\n\nThen, run the following command to modify the permissions of this configuration file. If the permissions are not modified, the configuration file will not take effect:\n\n```sh\nchmod 600 ~\u002F.ssh\u002Fconfig\n```\n\n### 5.6. Add the SSH public key of Host 0 to all hosts\n\nGenerate a key pair on Host 0:\n\n```sh\nssh-keygen -t rsa -f ~\u002F.ssh\u002Fid_rsa -N \"\"\n```\n\nView the generated SSH public key:\n\n```sh\ncat ~\u002F.ssh\u002Fid_rsa.pub\n```\n\nAdd this public key to the SSH keys in Google Cloud. This key will be automatically propagated to all hosts.\n\n### 5.7. Configure the `podrun` command\n\nThe `podrun` command is a tool under development. When executed on Host 0, it can run commands on all hosts via SSH.\n\nDownload `podrun`:\n\n```sh\nwget https:\u002F\u002Fraw.githubusercontent.com\u002Fayaka14732\u002Fllama-2-jax\u002F18e9625f7316271e4c0ad9dea233cfe23c400c9b\u002Fpodrun\nchmod +x podrun\n```\n\nEdit `~\u002Fpodips.txt` using:\n\n```sh\nnano ~\u002Fpodips.txt\n```\n\nSave the internal IP addresses of the other hosts in `~\u002Fpodips.txt`, one per line. For example:\n\n```sh\n172.21.12.86\n172.21.12.87\n172.21.12.83\n```\n\nA TPU v3-32 includes 4 hosts. Excluding Host 0, there are 3 more hosts. Hence, the `~\u002Fpodips.txt` for TPU v3-32 should contain 3 IP addresses.\n\nInstall Fabric using the system pip3:\n\n```sh\npip3 install fabric\n```\n\nUse `podrun` to make all hosts purr like a kitty:\n\n```sh\n.\u002Fpodrun -iw -- echo meow\n```\n\n### 5.8. Configure NFS\n\nInstall the NFS server and client:\n\n```sh\n.\u002Fpodrun -i -- sudo apt-get update -y -qq\n.\u002Fpodrun -i -- sudo apt-get upgrade -y -qq\n.\u002Fpodrun -- sudo apt-get install -y -qq nfs-common\nsudo apt-get install -y -qq nfs-kernel-server\nsudo mkdir -p \u002Fnfs_share\nsudo chown -R nobody:nogroup \u002Fnfs_share\nsudo chmod 777 \u002Fnfs_share\n```\n\nModify `\u002Fetc\u002Fexports`:\n\n```sh\nsudo nano \u002Fetc\u002Fexports\n```\n\nAdd:\n\n```\n\u002Fnfs_share  172.21.12.0\u002F24(rw,sync,no_subtree_check)\n```\n\nExecute:\n\n```sh\nsudo exportfs -a\nsudo systemctl restart nfs-kernel-server\n\n.\u002Fpodrun -- sudo mkdir -p \u002Fnfs_share\n.\u002Fpodrun -- sudo mount 172.21.12.2:\u002Fnfs_share \u002Fnfs_share\n.\u002Fpodrun -i -- ln -sf \u002Fnfs_share ~\u002Fnfs_share\n\ntouch ~\u002Fnfs_share\u002Fmeow\n.\u002Fpodrun -i -- ls -la ~\u002Fnfs_share\u002Fmeow\n```\n\nReplace `172.21.12.2` with the actual internal IP address of Host 0.\n\n### 5.9. Setting up the development environment in TPU Pod\n\nSave to `~\u002Fnfs_share\u002Fsetup.sh`:\n\n```sh\n#!\u002Fbin\u002Fbash\n\nexport DEBIAN_FRONTEND=noninteractive\n\nsudo apt-get update -y -qq\nsudo apt-get upgrade -y -qq\nsudo apt-get install -y -qq golang neofetch zsh byobu\n\nsudo apt-get install -y -qq software-properties-common\nsudo add-apt-repository -y ppa:deadsnakes\u002Fppa\nsudo apt-get install -y -qq python3.12-full python3.12-dev\n\nsh -c \"$(curl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002Fohmyzsh\u002Fohmyzsh\u002Fmaster\u002Ftools\u002Finstall.sh)\" \"\" --unattended\nsudo chsh $USER -s \u002Fusr\u002Fbin\u002Fzsh\n\npython3.12 -m venv ~\u002Fvenv\n\n. ~\u002Fvenv\u002Fbin\u002Factivate\n\npip install -U pip\npip install -U wheel\npip install -U \"jax[tpu]\" -f https:\u002F\u002Fstorage.googleapis.com\u002Fjax-releases\u002Flibtpu_releases.html\n```\n\nThen execute:\n\n```sh\nchmod +x ~\u002Fnfs_share\u002Fsetup.sh\n.\u002Fpodrun -i ~\u002Fnfs_share\u002Fsetup.sh\n```\n\n### 5.10. Verify JAX is working properly\n\n```sh\n.\u002Fpodrun -ic -- ~\u002Fvenv\u002Fbin\u002Fpython -c 'import jax; jax.distributed.initialize(); jax.process_index() == 0 and print(jax.devices())'\n```\n\nIf the output contains `TpuDevice`, this means JAX is working as expected.\n\n## 6. TPU Best Practices\n\n### 6.1. Prefer Google Cloud Platform to Google Colab\n\n[Google Colab](https:\u002F\u002Fcolab.research.google.com\u002F) only provides TPU v2-8 devices, while on [Google Cloud Platform](https:\u002F\u002Fcloud.google.com\u002Ftpu) you can select TPU v2-8 and TPU v3-8.\n\nBesides, on Google Colab you can only use TPU through the Jupyter Notebook interface. Even if you [log in into the Colab server via SSH](https:\u002F\u002Fayaka.shn.hk\u002Fcolab\u002F), it is a docker image and you don't have root access. On Google Cloud Platform, however, you have full access to the TPU VM.\n\nIf you really want to use TPU on Google Colab, you need to run [the following script](01-basics\u002Fsetup_colab_tpu.py) to set up TPU:\n\n```python\nimport jax\nfrom jax.tools.colab_tpu import setup_tpu\n\nsetup_tpu()\n\ndevices = jax.devices()\nprint(devices)  # should print TpuDevice\n```\n\n### 6.2. Prefer TPU VM to TPU node\n\nWhen you are creating a TPU instance, you need to choose between TPU VM and TPU node. Always prefer TPU VM because it is the new architecture in which TPU devices are connected to the host VM directly. This will make it easier to set up the TPU device.\n\n## 7. JAX Best Practices\n\n### 7.1. Import convention\n\nYou may see two different kind of import conventions. One is to import `jax.numpy` as `np` and import the original numpy as `onp`. Another one is to import `jax.numpy` as `jnp` and leave original numpy as `np`.\n\nOn 16 Jan 2019, Colin Raffel wrote in [a blog article](https:\u002F\u002Fcolinraffel.com\u002Fblog\u002Fyou-don-t-know-jax.html) that the convention at that time was to import original numpy as `onp`.\n\nOn 5 Nov 2020, Niru Maheswaranathan said in [a tweet](https:\u002F\u002Ftwitter.com\u002Fniru_m\u002Fstatus\u002F1324078070546882560) that he thinks the convention at that time was to import `jax.numpy` as `jnp` and to leave original numpy as `np`.\n\nWe can conclude that the new convention is to import `jax.numpy` as `jnp`.\n\n### 7.2. Manage random keys in JAX\n\nThe regular way is this:\n\n```python\nkey, *subkey = rand.split(key, num=4)\nprint(subkey[0])\nprint(subkey[1])\nprint(subkey[2])\n```\n\n### 7.3. Conversion between NumPy arrays and JAX arrays\n\nUse [`np.asarray`](https:\u002F\u002Fjax.readthedocs.io\u002Fen\u002Flatest\u002F_autosummary\u002Fjax.numpy.asarray.html) and [`onp.asarray`](https:\u002F\u002Fnumpy.org\u002Fdoc\u002Fstable\u002Freference\u002Fgenerated\u002Fnumpy.asarray.html).\n\n```python\nimport jax.numpy as np\nimport numpy as onp\n\na = np.array([1, 2, 3])  # JAX array\nb = onp.asarray(a)  # converted to NumPy array\n\nc = onp.array([1, 2, 3])  # NumPy array\nd = np.asarray(c)  # converted to JAX array\n```\n\n### 7.4. Conversion between PyTorch tensors and JAX arrays\n\nConvert a PyTorch tensor to a JAX array:\n\n```python\nimport jax.numpy as np\nimport torch\n\na = torch.rand(2, 2)  # PyTorch tensor\nb = np.asarray(a.numpy())  # JAX array\n```\n\nConvert a JAX array to a PyTorch tensor:\n\n```python\nimport jax.numpy as np\nimport numpy as onp\nimport torch\n\na = np.zeros((2, 2))  # JAX array\nb = torch.from_numpy(onp.asarray(a))  # PyTorch tensor\n```\n\nThis will result in a warning:\n\n```\nUserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ..\u002Ftorch\u002Fcsrc\u002Futils\u002Ftensor_numpy.cpp:178.)\n```\n\nIf you need writable tensors, you can use `onp.array` instead of `onp.asarray` to make a copy of the original array.\n\n### 7.5. Get the shapes of all parameters in a nested dictionary\n\n```python\njax.tree_map(lambda x: x.shape, params)\n```\n\n### 7.6. The correct way to generate random numbers on CPU\n\nUse the [jax.default_device()](https:\u002F\u002Fjax.readthedocs.io\u002Fen\u002Flatest\u002F_autosummary\u002Fjax.default_device.html) context manager:\n\n```python\nimport jax\nimport jax.random as rand\n\ndevice_cpu = jax.devices('cpu')[0]\nwith jax.default_device(device_cpu):\n    key = rand.PRNGKey(42)\n    a = rand.poisson(key, 3, shape=(1000,))\n    print(a.device())  # TFRT_CPU_0\n```\n\nSee \u003Chttps:\u002F\u002Fgithub.com\u002Fgoogle\u002Fjax\u002Fdiscussions\u002F9691#discussioncomment-3650311>.\n\n### 7.7. Use optimizers from Optax\n\n### 7.8. Use the cross-entropy loss implementation from Optax\n\n`optax.softmax_cross_entropy_with_integer_labels`\n\n## 8. How Can I...\n\n### 8.1. Share files across multiple TPU VM instances\n\nTPU VM instances in the same zone are connected with internal IPs, so you can [create a shared file system using NFS](https:\u002F\u002Ftecadmin.net\u002Fhow-to-install-and-configure-an-nfs-server-on-ubuntu-20-04\u002F).\n\n### 8.2. Monitor TPU usage\n\n[jax-smi](https:\u002F\u002Fgithub.com\u002Fayaka14732\u002Fjax-smi)\n\n### 8.3. Start a server on TPU VM\n\nExample: Tensorboard\n\nAlthough every TPU VM is allocated with a public IP, in most cases you should expose a server to the Internet because it is insecure.\n\nPort forwarding via SSH\n\n```\nssh -C -N -L 127.0.0.1:6006:127.0.0.1:6006 tpu1\n```\n\n### 8.4. Run separate processes on different TPU cores\n\nhttps:\u002F\u002Fgist.github.com\u002Fskye\u002Ff82ba45d2445bb19d53545538754f9a3\n\n## 9. Common Gotchas\n\n### 9.1. TPU VMs will be rebooted occasionally\n\nAs of 24 Oct 2022, the TPU VMs will be rebooted occasionally if there is a maintenance event.\n\nThe following things will happen:\n\n1. All the running processes will be terminated\n2. The external IP address will be changed\n\nWe can save the model parameters, optimiser states and other useful data occasionally, so that the model training can be easily resumed after termination.\n\nWe should use `gcloud` command instead of connect directly to it with SSH. If we have to use SSH (e.g. if we want to use VSCode, SSH is the only choice), we need to manually change the target IP address.\n\n### 9.2. One TPU core can only be used by one process at a time\n\nSee also: §10.5.\n\nUnlike GPU, you will get an error if you run two processes on TPU at a time:\n\n```\nI0000 00:00:1648534265.148743  625905 tpu_initializer_helper.cc:94] libtpu.so already in use by another process. Run \"$ sudo lsof -w \u002Fdev\u002Faccel0\" to figure out which process is using the TPU. Not attempting to load libtpu.so in this process.\n```\n\n### 9.3. TCMalloc breaks several programs\n\n[TCMalloc](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Ftcmalloc) is Google's customized memory allocation library. On TPU VM, `LD_PRELOAD` is set to use TCMalloc by default:\n\n```sh\n$ echo LD_PRELOAD\n\u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibtcmalloc.so.4\n```\n\nHowever, using TCMalloc in this manner may break several programs like gsutil:\n\n```sh\n$ gsutil --help\n\u002Fsnap\u002Fgoogle-cloud-sdk\u002F232\u002Fplatform\u002Fbundledpythonunix\u002Fbin\u002Fpython3: \u002Fsnap\u002Fgoogle-cloud-sdk\u002F232\u002Fplatform\u002Fbundledpythonunix\u002Fbin\u002F..\u002F..\u002F..\u002Flib\u002Fx86_64-linux-gnu\u002Flibm.so.6: version `GLIBC_2.29' not found (required by \u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibtcmalloc.so.4)\n```\n\nThe [homepage of TCMalloc](http:\u002F\u002Fgoog-perftools.sourceforge.net\u002Fdoc\u002Ftcmalloc.html) also indicates that `LD_PRELOAD` is tricky and this mode of usage is not recommended.\n\nIf you encounter problems related to TCMalloc, you can disable it in the current shell using the command:\n\n```sh\nunset LD_PRELOAD\n```\n\n### 9.4. `libtpu.so` already in used by another process\n\n```sh\nif ! pgrep -a -u $USER python ; then\n    killall -q -w -s SIGKILL ~\u002F.venv311\u002Fbin\u002Fpython\nfi\nrm -rf \u002Ftmp\u002Flibtpu_lockfile \u002Ftmp\u002Ftpu_logs\n```\n\nSee also \u003Chttps:\u002F\u002Fgithub.com\u002Fgoogle\u002Fjax\u002Fissues\u002F9220#issuecomment-1015940320>.\n\n### 9.5. JAX does not support the multiprocessing `fork` strategy\n\nUse the `spawn` or `forkserver` strategies.\n\nSee \u003Chttps:\u002F\u002Fgithub.com\u002Fgoogle\u002Fjax\u002Fissues\u002F1805#issuecomment-561244991>.\n","# TPU 入门指南\n\n\u003Ch4 align=\"center\">\n    \u003Cp>\n        \u003Cb>English\u003C\u002Fb> |\n        \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fayaka14732\u002Ftpu-starter\u002Fblob\u002Fmain\u002FREADME_ko.md\">한국어\u003C\u002Fa> |\n        \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fayaka14732\u002Ftpu-starter\u002Fblob\u002Fmain\u002FREADME_zh.md\">中文\u003C\u002Fa>\n    \u003Cp>\n\u003C\u002Fh4>\n\n关于 Google Cloud TPU 的一切你想知道的内容\n\n* [1. 社区](#1-社区)\n* [2. TPU 简介](#2-tpu-简介)\n    * [2.1. 为什么选择 TPU？](#21-为什么选择-tpu)\n    * [2.2. 如何免费获得 TPU 使用权限？](#22-如何免费获得-tpu-使用权限)\n    * [2.3. 如果 TPU 这么好，为什么很少看到别人使用？](#23-如果-tpu-这么好为什么很少看到别人使用)\n    * [2.4. 我现在知道 TPU 很棒了，我能实际接触一下 TPU 吗？](#24-我现在知道-tpu-很棒了我能实际接触一下-tpu-吗)\n    * [2.5. 创建 TPU 实例是什么意思？我实际上得到了什么？](#25-创建-tpu-实例是什么意思我实际上得到了什么)\n* [3. TRC 计划介绍](#3-trc-计划介绍)\n    * [3.1. 如何申请 TRC 计划？](#31-如何申请-trc-计划)\n    * [3.2. 真的是免费的吗？](#32-真的是免费的吗)\n* [4. 使用 TPU VM](#4-使用-tpu-vm)\n    * [4.1. 创建 TPU VM](#41-创建-tpu-vm)\n    * [4.2. 将 SSH 公钥添加到 Google Cloud](#42-将-ssh-公钥添加到-google-cloud)\n    * [4.3. SSH 登录 TPU VM](#43-ssh-登录-tpu-vm)\n    * [4.4. 验证 TPU VM 是否拥有 TPU](#44-验证-tpu-vm-是否拥有-tpu)\n    * [4.5. 在 TPU VM 中设置开发环境](#45-在-tpu-vm-中设置开发环境)\n    * [4.6. 验证 JAX 是否正常工作](#46-验证-jax-是否正常工作)\n    * [4.7. 使用 Byobu 确保程序持续运行](#47-使用-byobu-确保程序持续运行)\n    * [4.8. 配置 VSCode Remote-SSH](#48-配置-vscode-remote-ssh)\n    * [4.9. 在 TPU VM 上使用 Jupyter Notebook](#49-在-tpu-vm-上使用-jupyter-notebook)\n* [5. 使用 TPU Pod](#5-使用-tpu-pod)\n    * [5.1. 创建子网](#51-创建子网)\n    * [5.2. 禁用 Cloud Logging](#52-禁用-cloud-logging)\n    * [5.3. 创建 TPU Pod](#53-创建-tpu-pod)\n    * [5.4. SSH 登录 TPU Pod](#54-ssh-登录-tpu-pod)\n    * [5.5. 修改 Host 0 上的 SSH 配置文件](#55-修改-host-0-上的-ssh-配置文件)\n    * [5.6. 将 Host 0 的 SSH 公钥添加到所有主机](#56-将-host-0-的-ssh-公钥添加到所有主机)\n    * [5.7. 配置 podrun 命令](#57-配置-podrun-命令)\n    * [5.8. 配置 NFS](#58-配置-nfs)\n    * [5.9. 在 TPU Pod 中设置开发环境](#59-在-tpu-pod-中设置开发环境)\n    * [5.10. 验证 JAX 是否正常工作](#510-验证-jax-是否正常工作)\n* [6. TPU 最佳实践](#6-tpu-最佳实践)\n    * [6.1. 优先选择 Google Cloud Platform 而非 Google Colab](#61-优先选择-google-cloud-platform-而非-google-colab)\n    * [6.2. 优先选择 TPU VM 而非 TPU node](#62-优先选择-tpu-vm-而非-tpu-node)\n* [7. JAX 最佳实践](#7-jax-最佳实践)\n    * [7.1. 导入约定](#71-导入约定)\n    * [7.2. 在 JAX 中管理随机数密钥](#72-在-jax-中管理随机数密钥)\n    * [7.3. NumPy 数组与 JAX 数组之间的转换](#73-numpy-数组与-jax-数组之间的转换)\n    * [7.4. PyTorch 张量与 JAX 数组之间的转换](#74-pytorch-张量与-jax-数组之间的转换)\n    * [7.5. 获取嵌套字典中所有参数的形状](#75-获取嵌套字典中所有参数的形状)\n    * [7.6. 在 CPU 上生成随机数的正确方式](#76-在-cpu-上生成随机数的正确方式)\n    * [7.7. 使用 Optax 中的优化器](#77-使用-optax-中的优化器)\n    * [7.8. 使用 Optax 中的交叉熵损失实现](#78-使用-optax-中的交叉熵损失实现)\n* [8. 如何...](#8-如何)\n    * [8.1. 在多个 TPU VM 实例之间共享文件](#81-在多个-tpu-vm-实例之间共享文件)\n    * [8.2. 监控 TPU 使用情况](#82-监控-tpu-使用情况)\n    * [8.3. 在 TPU VM 上启动服务器](#83-在-tpu-vm-上启动服务器)\n    * [8.4. 在不同的 TPU 核心上运行独立进程](#84-在不同的-tpu-核心上运行独立进程)\n* [9. 常见陷阱](#9-常见陷阱)\n    * [9.1. TPU VM 会偶尔重启](#91-tpu-vm-会偶尔重启)\n    * [9.2. 一个 TPU 核心一次只能被一个进程使用](#92-一个-tpu-核心一次只能被一个进程使用)\n    * [9.3. TCMalloc 会导致多个程序出错](#93-tcmalloc-会导致多个程序出错)\n    * [9.4. libtpu.so 已被另一个进程占用](#94-libtpuso-已被另一个进程占用)\n    * [9.5. JAX 不支持 multiprocessing 的 fork 策略](#95-jax-不支持-multiprocessing-的-fork-策略)\n\n\u003C!-- Created by https:\u002F\u002Fgithub.com\u002Fekalinin\u002Fgithub-markdown-toc -->\n\n本项目受 [Cloud Run FAQ](https:\u002F\u002Fgithub.com\u002Fahmetb\u002Fcloud-run-faq) 启发，后者是关于另一款 Google Cloud 产品的社区维护知识库。\n\n## 1. 社区\n\nGoogle 的[官方 Discord 服务器](https:\u002F\u002Fdiscord.com\u002Finvite\u002Fgoogle-dev-community)已设立 `#tpu-research-cloud` 频道。\n\n## 2. TPU 简介\n\n### 2.1. 为什么选择 TPU？\n\n**TL;DR（太长不看）**：TPU 之于 GPU，正如 GPU 之于 CPU。\n\nTPU 是专为机器学习设计的硬件。有关性能对比，请参阅 Hugging Face Transformers 中的 [Performance Comparison](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fblob\u002Fmain\u002Fexamples\u002Fflax\u002Flanguage-modeling\u002FREADME.md#runtime-evaluation)：\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fayaka14732_tpu-starter_readme_8efb5865ab0b.png)\n\n此外，Google 的 [TRC 计划](https:\u002F\u002Fsites.research.google\u002Ftrc\u002Fabout\u002F)向研究人员提供免费的 TPU 资源。如果你曾思考过该用什么计算资源来训练模型，你应该尝试 TRC 计划，因为这是我所知的最佳选择。下文将提供更多关于 TRC 计划的信息。\n\n### 2.2. 如何免费获得 TPU 使用权限？\n\n研究人员可以申请 [TRC 计划](https:\u002F\u002Fsites.research.google\u002Ftrc\u002Fabout\u002F)以获得免费的 TPU 资源。\n\n### 2.3. 如果 TPU（张量处理单元）这么好，为什么我很少看到别人使用它？\n\n如果你打算使用 PyTorch，TPU 可能并不适合你。PyTorch 对 TPU 的支持很差。在我过去的一个使用 PyTorch 的实验中，一个 batch 在 CPU 上只需 14 秒，但在 TPU 上却需要 4 小时。Twitter 用户 @mauricetpunkt 也认为 [PyTorch 在 TPU 上的性能很差](https:\u002F\u002Ftwitter.com\u002Fmauricetpunkt\u002Fstatus\u002F1506944350281945090)。\n\n总之，如果你想用 TPU 进行深度学习，你应该选择 JAX 作为你的深度学习框架。事实上，许多流行的深度学习库都支持 JAX。例如：\n\n- [Hugging Face Transformers 中的许多模型支持 JAX](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Findex#supported-frameworks)\n- [Keras 支持将 JAX 作为后端](https:\u002F\u002Fkeras.io\u002Fkeras_core\u002Fannouncement\u002F)\n- SkyPilot 提供了 [使用 Flax 的示例](https:\u002F\u002Fgithub.com\u002Fskypilot-org\u002Fskypilot\u002Fblob\u002Fmaster\u002Fexamples\u002Ftpu\u002Ftpuvm_mnist.yaml)\n\n此外，JAX 的设计非常简洁，广受好评。例如，JAX 是我最喜欢的开源项目。我曾发推文讨论过 [JAX 为何优于 PyTorch](https:\u002F\u002Ftwitter.com\u002Fayaka14732\u002Fstatus\u002F1688194164033462272)。\n\n### 2.4. 我现在知道 TPU 很棒了。我能摸到一块 TPU 吗？\n\n不幸的是，我们通常无法实际触摸到真实的 TPU。TPU 是通过 Google Cloud 服务访问的。\n\n在一些展览中，TPU 会 [被展示出来供人观看](https:\u002F\u002Ftwitter.com\u002Fwalkforhours\u002Fstatus\u002F1696654844134822130)，这可能是你能最接近“亲手触摸”TPU 的方式了。\n\n也许只有成为 Google Cloud 基础设施工程师，才能真正感受到 TPU 的“触感”。\n\n### 2.5. 创建 TPU 实例是什么意思？我实际上得到了什么？\n\n在 [Google Cloud Platform](https:\u002F\u002Fcloud.google.com\u002Ftpu) 上创建一个 TPU v3-8 实例后，你会获得一台运行 Ubuntu 系统的云服务器，拥有 sudo 权限、96 个 CPU 核心、335 GiB 内存，以及一个包含 8 个核心的 TPU 设备（总计 128 GiB TPU 内存）。\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fayaka14732_tpu-starter_readme_621e9297edca.png)\n\n事实上，这与我们使用 GPU 的方式类似。通常，当我们使用 GPU 时，实际上是使用一台连接了 GPU 的 Linux 服务器。同样地，当我们使用 TPU 时，也是在使用一台连接了 TPU 的服务器。\n\n## 3. TRC 计划介绍\n\n### 3.1. 如何申请 TRC 计划？\n\n除了 TRC 计划的 [主页](https:\u002F\u002Fsites.research.google\u002Ftrc\u002Fabout\u002F) 外，Shawn 在 [google\u002Fjax#2108](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fjax\u002Fissues\u002F2108#issuecomment-866238579) 上撰写了一篇关于 TRC 计划的精彩文章。任何对 TPU 感兴趣的人都应立即阅读。\n\n### 3.2. 真的免费吗？\n\n在注册 Google Cloud 时，前三个月由于赠送的免费试用额度，TRC 计划完全免费。三个月后，我每月大约花费 13.95 港币（约合 1.78 美元）。这笔费用仅用于 TPU 服务器的网络流量，而 TPU 设备本身由 TRC 计划免费提供。\n\n## 4. 使用 TPU VM\n\n### 4.1. 创建 TPU VM\n\n打开 [Google Cloud Platform](https:\u002F\u002Fcloud.google.com\u002Ftpu)，进入 [TPU 管理页面](https:\u002F\u002Fconsole.cloud.google.com\u002Fcompute\u002Ftpus)。\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fayaka14732_tpu-starter_readme_215cc4221795.png)\n\n点击右上角的控制台按钮以激活 Cloud Shell。\n\n在 Cloud Shell 中，输入以下命令以创建一个 Cloud TPU v3-8 VM：\n\n```sh\nuntil gcloud alpha compute tpus tpu-vm create node-1 --project tpu-develop --zone europe-west4-a --accelerator-type v3-8 --version tpu-vm-base ; do : ; done\n```\n\n其中，`node-1` 是你要创建的 TPU VM 的名称，`--project` 是你的 Google Cloud 项目名称。\n\n上述命令会不断尝试创建 TPU VM，直到成功为止。\n\n### 4.2. 向 Google Cloud 添加 SSH 公钥\n\n对于 Google Cloud 的服务器，如果你想通过 SSH 登录，使用 `ssh-copy-id` 是错误的做法。正确的方法是：\n\n首先，在 Google Cloud 网页的搜索框中输入 “SSH keys”，进入相关页面，点击编辑，然后添加你电脑的 SSH 公钥。\n\n查看你电脑的 SSH 公钥：\n\n```sh\ncat ~\u002F.ssh\u002Fid_rsa.pub\n```\n\n如果你尚未创建 SSH 密钥对，可使用以下命令创建，然后再执行上述命令查看：\n\n```sh\nssh-keygen -t rsa -f ~\u002F.ssh\u002Fid_rsa -N \"\"\n```\n\n向 Google Cloud 添加 SSH 公钥时，需特别注意用户名的值。在 SSH 公钥字符串末尾 `@` 符号前的部分即为用户名。添加到 Google Cloud 后，它会在当前项目的所有服务器上创建一个该名称的用户。例如，对于字符串 `ayaka@instance-1`，Google Cloud 会在服务器上创建名为 `ayaka` 的用户。如果你希望 Google Cloud 创建不同的用户名，可以手动修改该字符串。例如，将字符串改为 `nixie@instance-1`，Google Cloud 就会创建名为 `nixie` 的用户。而且，这种修改不会影响 SSH 密钥的功能。\n\n### 4.3. SSH 登录 TPU VM\n\n在你本地电脑上创建或编辑 `~\u002F.ssh\u002Fconfig` 文件：\n\n```sh\nnano ~\u002F.ssh\u002Fconfig\n```\n\n添加以下内容：\n\n```\nHost tpuv3-8-1\n    User nixie\n    Hostname 34.141.220.156\n```\n\n其中，`tpuv3-8-1` 是任意名称，`User` 是上一步在 Google Cloud 中创建的用户名，`Hostname` 是 TPU VM 的 IP 地址。\n\n然后，在你本地电脑上使用以下命令 SSH 登录 TPU VM：\n\n```sh\nssh tpuv3-8-1\n```\n\n其中 `tpuv3-8-1` 是在 `~\u002F.ssh\u002Fconfig` 中设置的名称。\n\n### 4.4. 验证 TPU VM 是否拥有 TPU\n\n```sh\nls \u002Fdev\u002Faccel*\n```\n\n如果出现如下输出：\n\n```\n\u002Fdev\u002Faccel0  \u002Fdev\u002Faccel1  \u002Fdev\u002Faccel2  \u002Fdev\u002Faccel3\n```\n\n说明该 TPU VM 确实配备了 TPU。\n\n### 4.5. 在 TPU VM 中设置开发环境\n\n更新软件包：\n\n```sh\nsudo apt-get update -y -qq\nsudo apt-get upgrade -y -qq\nsudo apt-get install -y -qq golang neofetch zsh byobu\n```\n\n安装最新的 Python 3.12：\n\n```sh\nsudo apt-get install -y -qq software-properties-common\nsudo add-apt-repository -y ppa:deadsnakes\u002Fppa\nsudo apt-get install -y -qq python3.12-full python3.12-dev\n```\n\n安装 Oh My Zsh：\n\n```sh\nsh -c \"$(curl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002Fohmyzsh\u002Fohmyzsh\u002Fmaster\u002Ftools\u002Finstall.sh)\" \"\" --unattended\nsudo chsh $USER -s \u002Fusr\u002Fbin\u002Fzsh\n```\n\n创建虚拟环境（venv）：\n\n```sh\npython3.12 -m venv ~\u002Fvenv\n```\n\n激活 venv：\n\n```sh\n. ~\u002Fvenv\u002Fbin\u002Factivate\n```\n\n在 venv 中安装 JAX：\n\n```sh\npip install -U pip\npip install -U wheel\npip install -U \"jax[tpu]\" -f https:\u002F\u002Fstorage.googleapis.com\u002Fjax-releases\u002Flibtpu_releases.html\n```\n\n### 4.6. 验证 JAX 是否正常工作\n\n激活 venv 后，使用以下命令验证 JAX 是否正常工作：\n\n```sh\npython -c 'import jax; print(jax.devices())'\n```\n\n如果输出中包含 `TpuDevice`，说明 JAX 已按预期正常工作。\n\n### 4.7. 使用 Byobu 确保程序持续运行\n\n许多教程使用在命令末尾附加 `&` 的方法，使其在后台运行，这样即使退出 SSH 后程序仍会继续执行。然而，这是一种基础方法。正确的方式是使用像 Byobu 这样的终端窗口管理器（terminal window manager）。\n\n要运行 Byobu，只需使用 `byobu` 命令。然后在打开的窗口中执行命令即可。若要关闭窗口，可以直接强制关闭你本地计算机上的当前窗口。Byobu 会在服务器上继续运行。下次连接到服务器时，再次使用 `byobu` 命令即可恢复之前的窗口。\n\nByobu 拥有许多高级功能。你可以通过观看官方视频 [边听莫扎特边学习 Byobu](https:\u002F\u002Fyoutu.be\u002FNawuGmcvKus) 来了解这些功能。\n\n### 4.8. 配置 VSCode Remote-SSH\n\n打开 VSCode，在左侧访问扩展（Extensions）面板，搜索并安装 Remote - SSH。\n\n按下 \u003Ckbd>F1\u003C\u002Fkbd> 打开命令面板，输入 ssh，点击 “Remote-SSH: Connect to Host...”，然后点击你在 `~\u002F.ssh\u002Fconfig` 中设置的服务器名称（例如 `tpuv3-8-1`）。一旦 VSCode 在服务器上完成设置，你就可以直接在服务器上使用 VSCode 进行开发。\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fayaka14732_tpu-starter_readme_71ec2c3fe3d1.png)\n\n在你的本地计算机上，可以使用以下命令快速打开服务器上的某个目录：\n\n```sh\ncode --remote ssh-remote+tpuv3-8-1 \u002Fhome\u002Fayaka\u002Ftpu-starter\n```\n\n该命令将使用 VSCode 打开 `tpuv3-8-1` 上的 `\u002Fhome\u002Fayaka\u002Ftpu-starter` 目录。\n\n### 4.9. 在 TPU VM 上使用 Jupyter Notebook\n\n在配置好 VSCode 的 Remote-SSH 后，你可以在 VSCode 内使用 Jupyter Notebook。效果如下所示：\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fayaka14732_tpu-starter_readme_51815668f851.png)\n\n这里需要注意两点：第一，在 Jupyter Notebook 界面的右上角，应选择来自 `venv` 的 Kernel，即我们在前面步骤中创建的 `~\u002Fvenv\u002Fbin\u002Fpython`。第二，首次运行时，系统会提示你为 VSCode 安装 Jupyter 扩展，并在 `venv` 中安装 `ipykernel`。你需要确认这些操作。\n\n## 5. 使用 TPU Pod\n\n### 5.1. 创建子网\n\n要创建 TPU Pod，首先需要创建一个新的 VPC 网络（Virtual Private Cloud network），然后在该网络对应的区域（例如 `europe-west4-a`）中创建一个子网（subnet）。\n\nTODO: Purpose?\n\n### 5.2. 禁用 Cloud Logging\n\nTODO: Reason? Steps?\n\n### 5.3. 创建 TPU Pod\n\n使用之前创建 TPU VM 时描述的方法打开 Cloud Shell，并使用以下命令创建一个 TPU v3-32 Pod：\n\n```sh\nuntil gcloud alpha compute tpus tpu-vm create node-1 --project tpu-advanced-research --zone europe-west4-a --accelerator-type v3-32 --version v2-alpha-pod --network advanced --subnetwork advanced-subnet-for-europe-west4 ; do : ; done\n```\n\n其中 `node-1` 是你希望为 TPU VM 设置的名称，`--project` 是你的 Google Cloud 项目名称，`--network` 和 `--subnetwork` 是上一步中创建的网络和子网的名称。\n\n### 5.4. SSH 登录 TPU Pod\n\n由于 TPU Pod 由多个主机组成，我们需要选择其中一个主机作为 Host 0，然后通过 SSH 登录 Host 0 来执行命令。由于在 Google Cloud 网页上添加的 SSH 公钥会被自动分发到所有主机，因此每个主机都可以通过该 SSH 密钥直接连接，我们可以任意指定一个主机作为 Host 0。登录 Host 0 的方法与前述 TPU VM 相同。\n\n### 5.5. 修改 Host 0 上的 SSH 配置文件\n\nSSH 登录 Host 0 后，需要进行如下配置：\n\n```sh\nnano ~\u002F.ssh\u002Fconfig\n```\n\n添加以下内容：\n\n```\nHost 172.21.12.* 127.0.0.1\n    StrictHostKeyChecking no\n    UserKnownHostsFile \u002Fdev\u002Fnull\n    LogLevel ERROR\n```\n\n这里的 `172.21.12.*` 取决于上一步创建的子网 IP 地址范围。我们使用 `172.21.12.*` 是因为在创建子网时指定的 IP 地址范围为 172.21.12.0\u002F24。\n\n之所以这样做，是因为 SSH 的 `known_hosts` 文件用于防止中间人攻击（man-in-the-middle attacks）。而此处我们使用的是内网环境，无需防范此类攻击，也就不需要该文件，因此将其指向 `\u002Fdev\u002Fnull`。此外，保留 `known_hosts` 文件会导致首次连接时需要手动确认服务器指纹，这在内网环境中是不必要的，也不利于自动化操作。\n\n然后，运行以下命令修改该配置文件的权限。如果不修改权限，配置文件将不会生效：\n\n```sh\nchmod 600 ~\u002F.ssh\u002Fconfig\n```\n\n### 5.6. 将 Host 0 的 SSH 公钥添加到所有主机\n\n在 Host 0 上生成密钥对：\n\n```sh\nssh-keygen -t rsa -f ~\u002F.ssh\u002Fid_rsa -N \"\"\n```\n\n查看生成的 SSH 公钥：\n\n```sh\ncat ~\u002F.ssh\u002Fid_rsa.pub\n```\n\n将此公钥添加到 Google Cloud 的 SSH 密钥中。该密钥将自动分发到所有主机。\n\n### 5.7. 配置 `podrun` 命令\n\n`podrun` 是一个正在开发中的工具。当在 Host 0 上执行时，它可以通过 SSH 在所有主机上运行命令。\n\n下载 `podrun`：\n\n```sh\nwget https:\u002F\u002Fraw.githubusercontent.com\u002Fayaka14732\u002Fllama-2-jax\u002F18e9625f7316271e4c0ad9dea233cfe23c400c9b\u002Fpodrun\nchmod +x podrun\n```\n\n使用以下命令编辑 `~\u002Fpodips.txt`：\n\n```sh\nnano ~\u002Fpodips.txt\n```\n\n将其他主机的内网 IP 地址逐行保存到 `~\u002Fpodips.txt` 中。例如：\n\n```sh\n172.21.12.86\n172.21.12.87\n172.21.12.83\n```\n\n一个 TPU v3-32 包含 4 个主机。除去 Host 0，还有 3 个主机。因此，TPU v3-32 的 `~\u002Fpodips.txt` 应包含 3 个 IP 地址。\n\n使用系统自带的 pip3 安装 Fabric：\n\n```sh\npip3 install fabric\n```\n\n使用 `podrun` 让所有主机像小猫一样“喵喵”叫：\n\n```sh\n.\u002Fpodrun -iw -- echo meow\n```\n\n### 5.8. 配置 NFS\n\n安装 NFS 服务器和客户端：\n\n```sh\n.\u002Fpodrun -i -- sudo apt-get update -y -qq\n.\u002Fpodrun -i -- sudo apt-get upgrade -y -qq\n.\u002Fpodrun -- sudo apt-get install -y -qq nfs-common\nsudo apt-get install -y -qq nfs-kernel-server\nsudo mkdir -p \u002Fnfs_share\nsudo chown -R nobody:nogroup \u002Fnfs_share\nsudo chmod 777 \u002Fnfs_share\n```\n\n修改 `\u002Fetc\u002Fexports`：\n\n```sh\nsudo nano \u002Fetc\u002Fexports\n```\n\n添加：\n\n```\n\u002Fnfs_share  172.21.12.0\u002F24(rw,sync,no_subtree_check)\n```\n\n执行：\n\n```sh\nsudo exportfs -a\nsudo systemctl restart nfs-kernel-server\n\n.\u002Fpodrun -- sudo mkdir -p \u002Fnfs_share\n.\u002Fpodrun -- sudo mount 172.21.12.2:\u002Fnfs_share \u002Fnfs_share\n.\u002Fpodrun -i -- ln -sf \u002Fnfs_share ~\u002Fnfs_share\n\ntouch ~\u002Fnfs_share\u002Fmeow\n.\u002Fpodrun -i -- ls -la ~\u002Fnfs_share\u002Fmeow\n```\n\n请将 `172.21.12.2` 替换为 Host 0 实际的内网 IP 地址。\n\n### 5.9. 在 TPU Pod 中设置开发环境\n\n将以下内容保存为 `~\u002Fnfs_share\u002Fsetup.sh`：\n\n```sh\n#!\u002Fbin\u002Fbash\n\nexport DEBIAN_FRONTEND=noninteractive\n\nsudo apt-get update -y -qq\nsudo apt-get upgrade -y -qq\nsudo apt-get install -y -qq golang neofetch zsh byobu\n\nsudo apt-get install -y -qq software-properties-common\nsudo add-apt-repository -y ppa:deadsnakes\u002Fppa\nsudo apt-get install -y -qq python3.12-full python3.12-dev\n\nsh -c \"$(curl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002Fohmyzsh\u002Fohmyzsh\u002Fmaster\u002Ftools\u002Finstall.sh)\" \"\" --unattended\nsudo chsh $USER -s \u002Fusr\u002Fbin\u002Fzsh\n\npython3.12 -m venv ~\u002Fvenv\n\n. ~\u002Fvenv\u002Fbin\u002Factivate\n\npip install -U pip\npip install -U wheel\npip install -U \"jax[tpu]\" -f https:\u002F\u002Fstorage.googleapis.com\u002Fjax-releases\u002Flibtpu_releases.html\n```\n\n然后执行：\n\n```sh\nchmod +x ~\u002Fnfs_share\u002Fsetup.sh\n.\u002Fpodrun -i ~\u002Fnfs_share\u002Fsetup.sh\n```\n\n### 5.10. 验证 JAX 是否正常工作\n\n```sh\n.\u002Fpodrun -ic -- ~\u002Fvenv\u002Fbin\u002Fpython -c 'import jax; jax.distributed.initialize(); jax.process_index() == 0 and print(jax.devices())'\n```\n\n如果输出中包含 `TpuDevice`，则表示 JAX 已按预期工作。\n\n## 6. TPU 最佳实践\n\n### 6.1. 优先使用 Google Cloud Platform 而非 Google Colab\n\n[Google Colab](https:\u002F\u002Fcolab.research.google.com\u002F) 仅提供 TPU v2-8 设备，而在 [Google Cloud Platform](https:\u002F\u002Fcloud.google.com\u002Ftpu) 上你可以选择 TPU v2-8 和 TPU v3-8。\n\n此外，在 Google Colab 中你只能通过 Jupyter Notebook 界面使用 TPU。即使你[通过 SSH 登录到 Colab 服务器](https:\u002F\u002Fayaka.shn.hk\u002Fcolab\u002F)，那也是一个 Docker 镜像，并且你没有 root 权限。而在 Google Cloud Platform 上，你可以完全访问 TPU VM。\n\n如果你确实想在 Google Colab 上使用 TPU，则需要运行[以下脚本](01-basics\u002Fsetup_colab_tpu.py)来设置 TPU：\n\n```python\nimport jax\nfrom jax.tools.colab_tpu import setup_tpu\n\nsetup_tpu()\n\ndevices = jax.devices()\nprint(devices)  # 应该打印出 TpuDevice\n```\n\n### 6.2. 优先使用 TPU VM 而非 TPU node\n\n当你创建 TPU 实例时，需要在 TPU VM 和 TPU node 之间进行选择。始终优先选择 TPU VM，因为这是新架构，其中 TPU 设备直接连接到宿主机 VM，这将使 TPU 设备的设置更加简单。\n\n## 7. JAX 最佳实践\n\n### 7.1. 导入约定\n\n你可能会看到两种不同的导入约定。一种是将 `jax.numpy` 导入为 `np`，并将原始 NumPy 导入为 `onp`；另一种是将 `jax.numpy` 导入为 `jnp`，并将原始 NumPy 保留为 `np`。\n\n2019 年 1 月 16 日，Colin Raffel 在[一篇博客文章](https:\u002F\u002Fcolinraffel.com\u002Fblog\u002Fyou-don-t-know-jax.html)中提到，当时的约定是将原始 NumPy 导入为 `onp`。\n\n2020 年 11 月 5 日，Niru Maheswaranathan 在[一条推文](https:\u002F\u002Ftwitter.com\u002Fniru_m\u002Fstatus\u002F1324078070546882560)中表示，他认为当时的约定是将 `jax.numpy` 导入为 `jnp`，并将原始 NumPy 保留为 `np`。\n\n我们可以得出结论：新的约定是将 `jax.numpy` 导入为 `jnp`。\n\n### 7.2. 在 JAX 中管理随机密钥（random keys）\n\n常规做法如下：\n\n```python\nkey, *subkey = rand.split(key, num=4)\nprint(subkey[0])\nprint(subkey[1])\nprint(subkey[2])\n```\n\n### 7.3. NumPy 数组与 JAX 数组之间的转换\n\n使用 [`np.asarray`](https:\u002F\u002Fjax.readthedocs.io\u002Fen\u002Flatest\u002F_autosummary\u002Fjax.numpy.asarray.html) 和 [`onp.asarray`](https:\u002F\u002Fnumpy.org\u002Fdoc\u002Fstable\u002Freference\u002Fgenerated\u002Fnumpy.asarray.html)。\n\n```python\nimport jax.numpy as np\nimport numpy as onp\n\na = np.array([1, 2, 3])  # JAX 数组\nb = onp.asarray(a)  # 转换为 NumPy 数组\n\nc = onp.array([1, 2, 3])  # NumPy 数组\nd = np.asarray(c)  # 转换为 JAX 数组\n```\n\n### 7.4. PyTorch 张量与 JAX 数组之间的转换\n\n将 PyTorch 张量转换为 JAX 数组：\n\n```python\nimport jax.numpy as np\nimport torch\n\na = torch.rand(2, 2)  # PyTorch 张量\nb = np.asarray(a.numpy())  # JAX 数组\n```\n\n将 JAX 数组转换为 PyTorch 张量：\n\n```python\nimport jax.numpy as np\nimport numpy as onp\nimport torch\n\na = np.zeros((2, 2))  # JAX 数组\nb = torch.from_numpy(onp.asarray(a))  # PyTorch 张量\n```\n\n这将产生一个警告：\n\n```\nUserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ..\u002Ftorch\u002Fcsrc\u002Futils\u002Ftensor_numpy.cpp:178.)\n```\n\n如果你需要可写的张量，可以使用 `onp.array` 而不是 `onp.asarray`，以复制原始数组。\n\n### 7.5. 获取嵌套字典中所有参数的形状\n\n```python\njax.tree_map(lambda x: x.shape, params)\n```\n\n### 7.6. 在 CPU 上生成随机数的正确方式\n\n使用 [jax.default_device()](https:\u002F\u002Fjax.readthedocs.io\u002Fen\u002Flatest\u002F_autosummary\u002Fjax.default_device.html) 上下文管理器：\n\n```python\nimport jax\nimport jax.random as rand\n\ndevice_cpu = jax.devices('cpu')[0]\nwith jax.default_device(device_cpu):\n    key = rand.PRNGKey(42)\n    a = rand.poisson(key, 3, shape=(1000,))\n    print(a.device())  # TFRT_CPU_0\n```\n\n参见 \u003Chttps:\u002F\u002Fgithub.com\u002Fgoogle\u002Fjax\u002Fdiscussions\u002F9691#discussioncomment-3650311>。\n\n### 7.7. 使用 Optax 中的优化器\n\n### 7.8. 使用 Optax 中的交叉熵损失实现\n\n`optax.softmax_cross_entropy_with_integer_labels`\n\n## 8. 如何...\n\n### 8.1. 在多个 TPU VM 实例之间共享文件\n\n同一区域（zone）中的 TPU VM 实例通过内部 IP 相互连接，因此你可以[使用 NFS 创建共享文件系统](https:\u002F\u002Ftecadmin.net\u002Fhow-to-install-and-configure-an-nfs-server-on-ubuntu-20-04\u002F)。\n\n### 8.2. 监控 TPU 使用情况\n\n[jax-smi](https:\u002F\u002Fgithub.com\u002Fayaka14732\u002Fjax-smi)\n\n### 8.3. 在 TPU VM 上启动服务器\n\n示例：TensorBoard\n\n尽管每个 TPU VM 都分配了一个公网 IP，但在大多数情况下你不应将服务器直接暴露在互联网上，因为这不安全。\n\n通过 SSH 进行端口转发：\n\n```\nssh -C -N -L 127.0.0.1:6006:127.0.0.1:6006 tpu1\n```\n\n### 8.4. 在不同的 TPU 核心上运行独立进程\n\nhttps:\u002F\u002Fgist.github.com\u002Fskye\u002Ff82ba45d2445bb19d53545538754f9a3\n\n## 9. 常见陷阱\n\n### 9.1. TPU VM 会偶尔重启\n\n截至 2022 年 10 月 24 日，如果发生维护事件，TPU VM 会偶尔重启。\n\n以下情况会发生：\n\n1. 所有正在运行的进程将被终止  \n2. 外部 IP 地址将发生变化  \n\n我们可以定期保存模型参数、优化器状态以及其他有用的数据，以便在终止后轻松恢复模型训练。\n\n我们应该使用 `gcloud` 命令而不是直接通过 SSH 连接。如果我们必须使用 SSH（例如，如果我们想使用 VSCode，SSH 是唯一的选择），则需要手动更改目标 IP 地址。\n\n### 9.2. 一个 TPU 核心（TPU core）同一时间只能被一个进程使用\n\n另见：§10.5。\n\n与 GPU 不同，如果你同时在 TPU 上运行两个进程，将会收到错误：\n\n```\nI0000 00:00:1648534265.148743  625905 tpu_initializer_helper.cc:94] libtpu.so already in use by another process. Run \"$ sudo lsof -w \u002Fdev\u002Faccel0\" to figure out which process is using the TPU. Not attempting to load libtpu.so in this process.\n```\n\n### 9.3. TCMalloc 会导致多个程序异常\n\n[TCMalloc](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Ftcmalloc) 是 Google 定制的内存分配库。在 TPU VM 上，默认通过设置 `LD_PRELOAD` 来使用 TCMalloc：\n\n```sh\n$ echo LD_PRELOAD\n\u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibtcmalloc.so.4\n```\n\n然而，以这种方式使用 TCMalloc 可能会导致多个程序（如 gsutil）出错：\n\n```sh\n$ gsutil --help\n\u002Fsnap\u002Fgoogle-cloud-sdk\u002F232\u002Fplatform\u002Fbundledpythonunix\u002Fbin\u002Fpython3: \u002Fsnap\u002Fgoogle-cloud-sdk\u002F232\u002Fplatform\u002Fbundledpythonunix\u002Fbin\u002F..\u002F..\u002F..\u002Flib\u002Fx86_64-linux-gnu\u002Flibm.so.6: version `GLIBC_2.29' not found (required by \u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibtcmalloc.so.4)\n```\n\n[TCMalloc 的主页](http:\u002F\u002Fgoog-perftools.sourceforge.net\u002Fdoc\u002Ftcmalloc.html) 也指出，使用 `LD_PRELOAD` 的方式较为棘手，不推荐这样使用。\n\n如果你遇到与 TCMalloc 相关的问题，可以在当前 shell 中通过以下命令禁用它：\n\n```sh\nunset LD_PRELOAD\n```\n\n### 9.4. `libtpu.so` 已被另一个进程占用\n\n```sh\nif ! pgrep -a -u $USER python ; then\n    killall -q -w -s SIGKILL ~\u002F.venv311\u002Fbin\u002Fpython\nfi\nrm -rf \u002Ftmp\u002Flibtpu_lockfile \u002Ftmp\u002Ftpu_logs\n```\n\n另见 \u003Chttps:\u002F\u002Fgithub.com\u002Fgoogle\u002Fjax\u002Fissues\u002F9220#issuecomment-1015940320>。\n\n### 9.5. JAX 不支持 multiprocessing 的 `fork` 启动策略（start method）\n\n请使用 `spawn` 或 `forkserver` 启动策略。\n\n参见 \u003Chttps:\u002F\u002Fgithub.com\u002Fgoogle\u002Fjax\u002Fissues\u002F1805#issuecomment-561244991>。","# tpu-starter 快速上手指南\n\n## 环境准备\n\n- **操作系统**：本地需使用 Linux 或 macOS（Windows 用户建议使用 WSL2）\n- **Google Cloud 账号**：需注册 [Google Cloud Platform](https:\u002F\u002Fcloud.google.com\u002F) 账号，并启用 TPU API\n- **项目设置**：创建一个 GCP 项目并记录项目 ID（如 `tpu-develop`）\n- **区域选择**：推荐使用 `europe-west4-a` 或 `us-central1-b`（TRC 项目常用区域）\n- **SSH 密钥**：本地需生成 SSH 公私钥对（若尚未生成）：\n\n```sh\nssh-keygen -t rsa -f ~\u002F.ssh\u002Fid_rsa -N \"\"\n```\n\n> 💡 国内用户建议配置 gcloud CLI 使用代理或通过 Cloud Shell 操作，避免网络不稳定。\n\n## 安装步骤\n\n### 1. 创建 TPU VM 实例\n\n在 [Google Cloud Console](https:\u002F\u002Fconsole.cloud.google.com\u002Fcompute\u002Ftpus) 打开 Cloud Shell，执行：\n\n```sh\nuntil gcloud alpha compute tpus tpu-vm create node-1 \\\n  --project YOUR_PROJECT_ID \\\n  --zone europe-west4-a \\\n  --accelerator-type v3-8 \\\n  --version tpu-vm-base; do :; done\n```\n\n> 将 `YOUR_PROJECT_ID` 替换为你的实际项目 ID。命令会自动重试直至创建成功。\n\n### 2. 配置 SSH 公钥\n\n1. 复制本地公钥内容：\n   ```sh\n   cat ~\u002F.ssh\u002Fid_rsa.pub\n   ```\n2. 在 GCP 控制台搜索 “SSH keys”，进入 **元数据 > SSH 密钥** 页面\n3. 点击“编辑”，粘贴公钥（注意：公钥末尾的用户名将作为 TPU VM 上的登录用户名）\n\n### 3. 配置本地 SSH 别名（可选但推荐）\n\n编辑 `~\u002F.ssh\u002Fconfig`：\n\n```sh\nHost tpuv3-8-1\n    User YOUR_USERNAME\n    Hostname TPU_VM_IP_ADDRESS\n```\n\n> `YOUR_USERNAME` 是公钥中 `@` 前的部分（如 `ayaka`），`TPU_VM_IP_ADDRESS` 可在 TPU 实例详情页查看。\n\n### 4. 登录 TPU VM\n\n```sh\nssh tpuv3-8-1\n```\n\n### 5. 验证 TPU 设备\n\n登录后执行：\n\n```sh\nls \u002Fdev\u002Faccel*\n```\n\n若输出类似 `\u002Fdev\u002Faccel0 \u002Fdev\u002Faccel1 ...`，说明 TPU 已就绪。\n\n### 6. 安装开发环境（在 TPU VM 内执行）\n\n```sh\nsudo apt-get update -y -qq\nsudo apt-get install -y -qq software-properties-common\nsudo add-apt-repository -y ppa:deadsnakes\u002Fppa\nsudo apt-get install -y -qq python3.12 python3.12-venv python3.12-dev\n\npython3.12 -m venv env\nsource env\u002Fbin\u002Factivate\npip install --upgrade pip\npip install jax[tpu] -f https:\u002F\u002Fstorage.googleapis.com\u002Fjax-releases\u002Flibtpu_releases.html\n```\n\n> ⚠️ 国内用户可尝试替换 pip 源加速安装（如 `-i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`），但 JAX 的 TPU 后端必须从官方链接安装。\n\n## 基本使用\n\n验证 JAX 是否能正确识别 TPU：\n\n```python\nimport jax\nprint(\"TPU devices:\", jax.devices())\nprint(\"Device kind:\", jax.devices()[0].device_kind)\n\n# 简单计算测试\nimport jax.numpy as jnp\nx = jnp.ones((1024, 1024))\ny = jnp.dot(x, x)\nprint(\"Computation done on TPU!\")\n```\n\n保存为 `test_tpu.py` 并运行：\n\n```sh\npython test_tpu.py\n```\n\n若输出包含 `TPU v3` 或 `TPU v4` 且无报错，说明环境配置成功！","某高校研究团队正在复现一篇基于 JAX 的大规模 Transformer 论文，需要在 Google Cloud TPU 上训练模型，但团队成员此前只用过 GPU，对 TPU 架构和配置流程完全陌生。\n\n### 没有 tpu-starter 时\n- 团队花费数天查阅零散的官方文档，仍不清楚如何申请免费 TRC（TPU Research Cloud）资源，多次申请被拒。\n- 成功创建 TPU VM 后，不确定是否正确识别到 TPU 设备，反复调试环境依赖却无法运行 JAX 示例代码。\n- 配置 VSCode Remote-SSH 和 Jupyter Notebook 时遇到权限和网络问题，开发效率极低。\n- 尝试使用 TPU Pod 进行分布式训练时，因不熟悉 NFS 和 SSH 密钥同步机制，集群始终无法协同工作。\n- 缺乏 JAX 最佳实践指导，在随机数生成、数据类型转换等细节上频繁出错，导致训练结果不稳定。\n\n### 使用 tpu-starter 后\n- 直接参考项目中“如何申请 TRC”的清晰步骤，一次性成功获得免费 TPU 配额。\n- 按照“验证 TPU 是否可用”和“设置开发环境”章节操作，10 分钟内跑通 JAX 基础示例。\n- 利用内置的 VSCode Remote-SSH 和 Jupyter 配置指南，快速搭建熟悉的本地开发体验。\n- 通过 TPU Pod 部署手册逐步完成子网、NFS 和 podrun 配置，顺利启动多芯片并行训练。\n- 遵循 JAX 最佳实践（如随机键管理、Optax 优化器使用），显著提升代码健壮性和训练稳定性。\n\ntpu-starter 将 TPU 从“高门槛专用硬件”转变为“开箱即用的研究工具”，大幅降低 AI 研究者的基础设施学习成本。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fayaka14732_tpu-starter_9bed4c55.png","ayaka14732","Ayaka Mikazuki","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fayaka14732_6b58b698.jpg","I am a passionate developer who has made significant contributions to the open-source community.",null,"London","ayaka@mail.shn.hk","https:\u002F\u002Fen.ayaka.shn.hk\u002F","https:\u002F\u002Fgithub.com\u002Fayaka14732",[85,89],{"name":86,"color":87,"percentage":88},"Python","#3572A5",93.1,{"name":90,"color":91,"percentage":92},"Shell","#89e051",6.9,567,31,"2026-03-18T01:34:08","CC-BY-4.0",4,"Linux","未说明","335 GiB（TPU VM实例配置）",{"notes":102,"python":103,"dependencies":104},"该项目专为Google Cloud TPU设计，需通过Google Cloud Platform创建TPU VM或TPU Pod实例；不支持本地GPU运行；推荐使用JAX作为深度学习框架；PyTorch对TPU支持较差；需配置SSH密钥并通过gcloud命令行工具管理TPU实例；开发环境需在TPU VM内部搭建。","3.12",[105,106,107,108,109],"JAX","Optax","Flax","Hugging Face Transformers","Keras",[13],[112,113,114,115,116,117,118],"tpu","deep-learning","cloud-tpu","google-cloud-platform","gcp","machine-learning","jax","2026-03-27T02:49:30.150509","2026-04-06T08:40:45.965700",[122,127,131,136,140],{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},609,"使用 Hugging Face 的 Accelerate 后还能同时使用 DataParallel 吗？","不能同时使用。Accelerate 是一个用于简化分布式训练的库，它本身已经封装了多 GPU\u002FTPU 的并行逻辑（如 DistributedDataParallel），因此不应再手动使用 PyTorch 的 DataParallel。两者混用会导致行为异常，例如日志重复、训练效率下降和模型权重保存多个副本等问题。","https:\u002F\u002Fgithub.com\u002Fayaka14732\u002Ftpu-starter\u002Fissues\u002F1",{"id":128,"question_zh":129,"answer_zh":130,"source_url":126},610,"使用 Accelerate 设置多进程训练时，为什么日志重复、训练速度没提升，且 loss 效果不如 DataParallel？","这是因为 Accelerate 默认在每个进程（GPU\u002FTPU 核心）上都会执行完整的脚本，包括打印日志和保存模型。你需要确保只在主进程（如通过 accelerator.is_main_process 判断）执行日志输出和模型保存操作。此外，Accelerate 使用的是更高效的 DistributedDataParallel 而非 DataParallel，初期 loss 行为可能不同，但长期训练效果通常更好。请参考官方示例正确组织训练循环。",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},611,"TPU 的 v2-nightly 软件版本不再可用，该如何解决？","Google 已弃用 v2-nightly TPU 软件版本，建议改用 tpu-vm-base。在创建 TPU 虚拟机时，将软件版本指定为 tpu-vm-base 即可。项目维护者已在提交 d4557b0 中更新配置以使用该版本，确保兼容性。","https:\u002F\u002Fgithub.com\u002Fayaka14732\u002Ftpu-starter\u002Fissues\u002F11",{"id":137,"question_zh":138,"answer_zh":139,"source_url":135},612,"当前推荐使用哪个 TPU 软件版本来启动虚拟机？","目前推荐使用 tpu-vm-base 作为 TPU 虚拟机的软件版本，因为 v2-nightly 等旧版 nightly 镜像已不再提供支持。使用 tpu-vm-base 可避免因版本缺失导致的部署失败或运行异常。",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},613,"项目文档中的 TPU Discord 链接失效了怎么办？","该链接确实已失效，维护者已确认问题。建议通过其他社区渠道（如 Hugging Face 论坛、GitHub Discussions 或官方 TPU 文档）获取支持，或关注项目 README 是否更新了新的交流链接。","https:\u002F\u002Fgithub.com\u002Fayaka14732\u002Ftpu-starter\u002Fissues\u002F16",[146,151],{"id":147,"version":148,"summary_zh":149,"released_at":150},109896,"v2.0","Publish v2.0","2023-10-08T10:33:42",{"id":152,"version":153,"summary_zh":154,"released_at":155},109897,"v1.0","Publish v1.0","2022-10-08T05:31:31"]