[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-google-research--t5x":3,"tool-google-research--t5x":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":68,"owner_location":68,"owner_email":68,"owner_twitter":68,"owner_website":79,"owner_url":80,"languages":81,"stars":98,"forks":99,"last_commit_at":100,"license":101,"difficulty_score":102,"env_os":103,"env_gpu":104,"env_ram":105,"env_deps":106,"category_tags":117,"github_topics":68,"view_count":23,"oss_zip_url":68,"oss_zip_packed_at":68,"status":16,"created_at":118,"updated_at":119,"faqs":120,"releases":149},3770,"google-research\u002Ft5x","t5x",null,"t5x 是一个专为序列模型（尤其是语言模型）打造的高性能研究框架，支持从训练、评估到推理的全流程操作。它本质上是经典 T5 代码库的现代化升级版，基于 JAX 和 Flax 重构，旨在解决大规模模型训练中配置复杂、扩展性差及硬件利用率低等痛点。\n\n该工具采用模块化与可组合的设计架构，让研究人员能够灵活定制实验流程，轻松在单节点或多节点环境下部署任务。无论是想在 Google Cloud 上利用 TPU 进行快速原型验证，还是希望在 SLURM 集群或 NVIDIA H100 GPU 上追求极致算力，t5x 都提供了详尽的支持与优化脚本，显著降低了分布式训练的管理门槛。\n\nt5x 特别适合人工智能研究人员、算法工程师以及对大模型技术有深入探索需求的开发者使用。其独特的技术亮点在于完美融合了 JAX 函数式编程的高效性与 Mesh TensorFlow 的并行策略，不仅实现了代码的简洁优雅，更确保了在超大规模参数下的训练稳定性与速度。对于希望复现前沿成果或开展自定义模型研究的团队而言，这是一个友好且强大的得力助手。","# T5X\n\n*Go to [T5X ReadTheDocs Documentation Page](https:\u002F\u002Ft5x.readthedocs.io\u002F).*\n\nT5X is a modular, composable, research-friendly framework for high-performance,\nconfigurable, self-service training, evaluation, and inference of sequence\nmodels (starting with language) at many scales.\n\nIt is essentially a new and improved implementation of the\n[T5 codebase](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftext-to-text-transfer-transformer)\n(based on [Mesh TensorFlow](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Fmesh)) in [JAX](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fjax) and [Flax](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fflax). To learn\nmore, see the [T5X Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.17189).\n\nBelow is a quick start guide for training models with TPUs on Google Cloud. For\nadditional tutorials and background, see the [complete documentation](docs\u002Findex.md).\n\n## Quickstart (Recommended)\n\nT5X can be run with [XManager](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fxmanager) on\n[Vertex AI](https:\u002F\u002Fcloud.google.com\u002Fvertex-ai). Vertex AI is a platform for\ntraining that creates TPU instances and runs code on the TPUs. Vertex AI will\nalso shut down the TPUs when the jobs terminate. This is signifcantly easier\nthan managing GCE VMs and TPU VM instances.\n\n1. Follow the pre-requisites and directions to install [XManager](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fxmanager).\n\n2. Request TPU quota as required. GCP projects come with 8 cores by default,\nwhich is enough to run one training experiment on a single TPU host. If you want\nto run multi-host training or run multiple trials in parallel, you will need\nmore quota. Navigate to [Quotas](https:\u002F\u002Fconsole.cloud.google.com\u002Fquotas).\n\n  The quota you want is:\n\n  * Service: `Vertex AI API`\n  * Dimensions (location): `us-central1`\n  * If you want to run single-host experiments:\n    * `Custom model training TPU V2 cores per region`\n    * `Custom model training TPU V3 cores per region`\n  * If you want to run multi-host experiments:\n    * `Custom model training TPU V2 pod cores per region`\n    * `Custom model training TPU V3 pod cores per region`\n\n  TIP: You won't be able to run single-host experiments with multi-host quota.\n  (i.e. you can't run `tpu_v2=8` using `TPU V2 pod`)\n\n\n3. Launch the xmanager script located at `t5x\u002Fscripts\u002Fxm_launch.py`.\n\nAs a running example, we use the WMT14 En-De translation which is described in\nmore detail in the Examples section below.\n\n```sh\nexport GOOGLE_CLOUD_BUCKET_NAME=...\nexport TFDS_DATA_DIR=gs:\u002F\u002F$GOOGLE_CLOUD_BUCKET_NAME\u002Ft5x\u002Fdata\nexport MODEL_DIR=gs:\u002F\u002F$GOOGLE_CLOUD_BUCKET_NAME\u002Ft5x\u002F$(date +%Y%m%d)\n\n# Pre-download dataset in multi-host experiments.\ntfds build wmt_t2t_translate --data_dir=$TFDS_DATA_DIR\n\ngit clone https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\ncd .\u002Ft5x\u002F\n\npython3 .\u002Ft5x\u002Fscripts\u002Fxm_launch.py \\\n  --gin_file=t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_from_scratch.gin \\\n  --model_dir=$MODEL_DIR \\\n  --tfds_data_dir=$TFDS_DATA_DIR\n```\n\nCheck `gs:\u002F\u002F$GOOGLE_CLOUD_BUCKET_NAME\u002Ft5x\u002F` for the output artifacts, which can\nbe read by TensorBoard.\n\n## GPU Usage\nNote: NVIDIA has released an updated version of this repository with H100 FP8 support and broad GPU performance improvements. Please visit the [NVIDIA Rosetta](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FJAX-Toolbox\u002Ftree\u002Fmain\u002Frosetta\u002Frosetta\u002Fprojects\u002Ft5x) repository for more details and usage instructions.\n\nT5X can be run easily on GPUs either in single-node configurations or multi-node configurations with a SLURM+pyxis cluster. Further instructions at [t5x\u002Fcontrib\u002Fgpu](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\u002Fblob\u002Fmain\u002Ft5x\u002Fcontrib\u002Fgpu\u002FREADME.md). The `t5x\u002Fcontrib\u002Fgpu\u002Fscripts_gpu` folder contains example scripts for pretraining T5X on [The Pile](https:\u002F\u002Fpile.eleuther.ai\u002F) and for finetuning on SQuAD and MNLI. These scripts and associated `gin` configurations also contain additional GPU optimizations for better throughput. More examples and instructions can be found in the [NVIDIA Rosetta](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FJAX-Toolbox\u002Ftree\u002Fmain\u002Frosetta\u002Frosetta\u002Fprojects\u002Ft5x) repository maintained by NVIDIA with H100 FP8 support and broad GPU performance improvements.\n\n\n## Installation\n\nNote that all the commands in this document should be run in the commandline of\nthe TPU VM instance unless otherwise stated.\n\n1.  Follow the\n    [instructions](https:\u002F\u002Fcloud.google.com\u002Ftpu\u002Fdocs\u002Fjax-quickstart-tpu-vm#install_the_google_cloud_sdk)\n    to set up a Google Cloud Platform (GCP) account and enable the Cloud TPU\n    API.\n\n    **Note:** T5X also works with GPU, please follow instructions in [t5x\u002Fcontrib\u002Fgpu](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\u002Fblob\u002Fmain\u002Ft5x\u002Fcontrib\u002Fgpu\u002FREADME.md) if you'd like to use GPU version.\n\n2.  Create a\n    [Cloud TPU VM instance](https:\u002F\u002Fcloud.google.com\u002Fblog\u002Fproducts\u002Fcompute\u002Fintroducing-cloud-tpu-vms)\n    following\n    [this instruction](https:\u002F\u002Fcloud.google.com\u002Ftpu\u002Fdocs\u002Fjax-quickstart-tpu-vm#create-vm).\n    We recommend that you develop your workflow in a single v3-8 TPU (i.e.,\n    `--accelerator-type=v3-8`) and scale up to pod slices once the pipeline is\n    ready. In this README, we focus on using a single v3-8 TPU. See\n    [here](https:\u002F\u002Fcloud.google.com\u002Ftpu\u002Fdocs\u002Fsystem-architecture-tpu-vm) to\n    learn more about TPU architectures.\n\n3.  With Cloud TPU VMs, you ssh directly into the host machine of the TPU VM.\n    You can install packages, run your code run, etc. in the host machine. Once\n    the TPU instance is created, ssh into it with\n\n    ```sh\n    gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} --zone=${ZONE}\n    ```\n\n    where `TPU_NAME` and `ZONE` are the name and the zone used in step 2.\n\n4.  Install T5X and the dependencies.\n\n    ```sh\n    git clone --branch=main https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\n    cd t5x\n\n    python3 -m pip install -e '.[tpu]' -f \\\n      https:\u002F\u002Fstorage.googleapis.com\u002Fjax-releases\u002Flibtpu_releases.html\n\n    ```\n\n\n5.  Create Google Cloud Storage (GCS) bucket to store the dataset and model\n    checkpoints. To create a GCS bucket, see these\n    [instructions](https:\u002F\u002Fcloud.google.com\u002Fstorage\u002Fdocs\u002Fcreating-buckets).\n\n6.  (optional) If you prefer working with Jupyter\u002FColab style environment\n    you can setup a custom Colab runtime by following steps from\n    [t5x\u002Fnotebooks](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\u002Fblob\u002Fmain\u002Ft5x\u002Fnotebooks\u002FREADME.md).\n\n## Example: English to German translation\n\nAs a running example, we use the WMT14 En-De translation. The raw dataset is\navailable in TensorFlow Datasets as\n[\"wmt_t2t_translate\"](https:\u002F\u002Fwww.tensorflow.org\u002Fdatasets\u002Fcatalog\u002Fwmt_t2t_translate).\n\nT5 casts the translation task such as the following\n\n```py\n{'en': 'That is good.', 'de': 'Das ist gut.'}\n```\n\nto the form called \"text-to-text\":\n\n```py\n{'inputs': 'translate English to German: That is good.', 'targets': 'Das ist gut.'}\n```\n\nThis formulation allows many different classes of language tasks to be expressed\nin a uniform manner and a single encoder-decoder architecture can handle them\nwithout any task-specific parameters. For more detail, refer to the [T5 paper\n(Raffel et al. 2019)][t5_paper].\n\nFor a scalable data pipeline and an evaluation framework, we use\n[`SeqIO`](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fseqio), which was factored out of the [T5\nlibrary][t5_github]. A `seqio.Task` packages together the raw dataset, vocabulary,\npreprocessing such as tokenization and evaluation metrics such as\n[BLEU](https:\u002F\u002Faclanthology.org\u002FP02-1040.pdf) and provides a\n[`tf.data`](https:\u002F\u002Fwww.tensorflow.org\u002Fguide\u002Fdata) instance.\n\n[The T5 library][t5_github] provides a number of `seqio.Task`s that were used in the\n[T5 paper][t5_paper]. In this example, we use [wmt_t2t_ende_v003](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftext-to-text-transfer-transformer\u002Fblob\u002Fd81c0bab2a41b4d5dfbe4971de32f7d67df65f31\u002Ft5\u002Fdata\u002Ftasks.py#L212).\n\nBefore training or fine-tuning you need to download [\"wmt_t2t_translate\"]\n(https:\u002F\u002Fwww.tensorflow.org\u002Fdatasets\u002Fcatalog\u002Fwmt_t2t_translate) dataset first.\n\n```sh\n# Data dir to save the processed dataset in \"gs:\u002F\u002Fdata_dir\" format.\nTFDS_DATA_DIR=\"...\"\n\n# Make sure that dataset package is up-to-date.\npython3 -m pip install --upgrade tfds-nightly\n\n# Pre-download dataset.\ntfds build wmt_t2t_translate ${TFDS_DATA_DIR}\n```\n\n### Training\n\nTo run a training job, we use the `t5x\u002Ftrain.py` script.\n\n```sh\n# Model dir to save logs, ckpts, etc. in \"gs:\u002F\u002Fmodel_dir\" format.\nMODEL_DIR=\"...\"\nT5X_DIR=\"...\"  # directory where the T5X repo is cloned.\nTFDS_DATA_DIR=\"...\"\n\npython3 ${T5X_DIR}\u002Ft5x\u002Ftrain.py \\\n  --gin_file=\"t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_from_scratch.gin\" \\\n  --gin.MODEL_DIR=\\\"${MODEL_DIR}\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\nThe configuration for this training run is defined in the Gin file\n[base_wmt_from_scratch.gin](t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_from_scratch.gin).\n[Gin-config](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fgin-config) is a library to handle\nconfigurations based on dependency injection. Among many benefits, Gin allows\nusers to pass custom components such as a custom model to the T5X library\nwithout having to modify the core library. The [custom\ncomponents](#custom-components) section shows how this is done.\n\nWhile the core library is independent of Gin, it is central to the examples we\nprovide. Therefore, we provide a short [introduction][gin-primer] to Gin in the\ncontext of T5X.  All the configurations are written to a file \"config.gin\" in\n`MODEL_DIR`. This makes debugging as well as reproducing the experiment much\neasier.\n\nIn addition to the `config.json`, `model-info.txt` file summarizes the model\nparameters (shape, names of the axes, partitioning info) as well as the\noptimizer states.\n\n\n\n#### TensorBoard\n\nTo monitor the training in [TensorBoard](https:\u002F\u002Fwww.tensorflow.org\u002Ftensorboard), it is much easier (due to\nauthentification issues) to launch the TensorBoard on your own machine and _not_ in\nthe TPU VM. So in the commandline where you ssh'ed into the TPU VM, launch the\nTensorBoard with the `logdir` pointing to the `MODEL_DIR`.\n\n```sh\n# NB: run this on your machine not TPU VM!\nMODEL_DIR=\"...\"  # Copy from the TPU VM.\ntensorboard --logdir=${MODEL_DIR}\n```\n\nOr you can launch the TensorBoard inside a Colab. In a Colab cell, run\n\n```python\nfrom google.colab import auth\nauth.authenticate_user()\n```\n\nto authorize the Colab to access the GCS bucket and launch the TensorBoard.\n\n```python\n%load_ext tensorboard\nmodel_dir = \"...\"  # Copy from the TPU VM.\n%tensorboard --logdir=model_dir\n```\n\n\n### Fine-tuning\n\nWe can leverage the benefits of self-supervised pre-training by initializing\nfrom one of our pre-trained models. Here we use the T5.1.1 Base checkpoint.\n\n```sh\n# Model dir to save logs, ckpts, etc. in \"gs:\u002F\u002Fmodel_dir\" format.\nMODEL_DIR=\"...\"\n\n# Data dir to save the processed dataset in \"gs:\u002F\u002Fdata_dir\" format.\nTFDS_DATA_DIR=\"...\"\nT5X_DIR=\"...\"  # directory where the T5X repo is cloned.\n\npython3 ${T5X_DIR}\u002Ft5x\u002Ftrain.py \\\n  --gin_file=\"t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_finetune.gin\" \\\n  --gin.MODEL_DIR=\\\"${MODEL_DIR}\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\n**Note:** when supplying a string, dict, list, tuple value, or a bash variable\nvia a flag, you must put it in quotes. In the case of strings, it requires\nescaped quotes (`\\\"\u003Cstring>\\\"`). For example:\n`--gin.utils.DatasetConfig.split=\\\"validation\\\"` or\n`--gin.MODEL_DIR=\\\"${MODEL_DIR}\\\"`.\n\nGin makes it easy to change a number of configurations. For example, you can\nchange the `partitioning.PjitPartitioner.num_partitions` (overriding\nthe value in\n[base_wmt_from_scratch.gin](t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_from_scratch.gin))\nto chanage the parallelism strategy and pass it as a commandline arg.\n\n```sh\n--gin.partitioning.PjitPartitioner.num_partitions=8\n```\n\n### Evaluation\n\nTo run the offline (i.e. without training) evaluation, you can use `t5x\u002Feval.py`\nscript.\n\n```sh\nEVAL_OUTPUT_DIR=\"...\"  # directory to write eval output\nT5X_DIR=\"...\"  # directory where the t5x is cloned, e.g., ${HOME}\"\u002Ft5x\".\nTFDS_DATA_DIR=\"...\"\nCHECKPOINT_PATH=\"...\"\n\npython3 ${T5X_DIR}\u002Ft5x\u002Feval.py \\\n  --gin_file=\"t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_eval.gin\" \\\n  --gin.CHECKPOINT_PATH=\\\"${CHECKPOINT_PATH}\\\" \\\n  --gin.EVAL_OUTPUT_DIR=\\\"${EVAL_OUTPUT_DIR}\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\n\n### Inference\n\nTo run inference, you can use `t5x\u002Finfer.py` script. Here we use the same\n`seqio.Task`, but for inference we do not use the targets features other than\nlogging them alongside the prediction in a JSON file.\n\n```sh\nINFER_OUTPUT_DIR=\"...\"  # directory to write infer output\nT5X_DIR=\"...\"  # directory where the t5x is cloned, e.g., ${HOME}\"\u002Ft5x\".\nTFDS_DATA_DIR=\"...\"\nCHECKPOINT_PATH=\"...\"\n\npython3 ${T5X_DIR}\u002Ft5x\u002Finfer.py \\\n  --gin_file=\"t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_infer.gin\" \\\n  --gin.CHECKPOINT_PATH=\\\"${CHECKPOINT_PATH}\\\" \\\n  --gin.INFER_OUTPUT_DIR=\\\"${INFER_OUTPUT_DIR}\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\n### Exporting as TensorFlow Saved Model\n\nPretrained model can be exported as TensorFlow Saved Model, and deployed\nto Vertex AI Prediction service using [Optimized TensorFlow Runtime]\n(https:\u002F\u002Fcloud.google.com\u002Fvertex-ai\u002Fdocs\u002Fpredictions\u002Foptimized-tensorflow-runtime).\nPlease note that exported model won't work on OSS based\n[TensorFlow Model Server](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Fserving).\n\n```sh\nT5X_DIR=\"...\"  # directory where the t5x is cloned, e.g., ${HOME}\"\u002Ft5x\".\nCHECKPOINT_PATH=\"...\"\n\nBATCH_SIZE=None\nBEAM_SIZE=1\n\n# Use 'bfloat16' if you plan to run exported model on NVIDIA A100 or newer GPUs,\n# for other GPUs use 'float32'.\nACTIVATION_DTYPE=bfloat16\n\n# Version numbers must be numeric. We generate one based on datetime.\nVERSION=$(date +%Y%m%d%H%M%S)\n\nNAME=t5x_base_${ACTIVATION_DTYPE}  # Model name.\n\n# Path to export model to. Note that export script is going to add _cpu suffix\n# after model name.\nOUTPUT=${CHECKPOINT_PATH}\u002Fsaved_model.${NAME}\u002F${VERSION}\n\ndeclare -a ARGS=(\n--gin_file=t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fbase.gin\n--gin_file=t5x\u002Ft5x\u002Fconfigs\u002Fruns\u002Fexport.gin\n--gin.TASK_FEATURE_LENGTHS=\"{'inputs': 256, 'targets': 256}\"\n--gin.CHECKPOINT_PATH=\\\"${CHECKPOINT_PATH}\\\"\n--gin.MODEL_NAME=\\\"\u002Fml\u002F${USER}\u002Ft5x_base\\\"\n--gin.MODEL_OUTPUT_DIR=\\\"${OUTPUT}\\\"\n--gin.BEAM_SIZE=${BEAM_SIZE}\n--gin.BATCH_SIZE=${BATCH_SIZE}\n--gin.export_lib.save.partitioner=None\n--gin.export_lib.save.warmup_examples=\"['hello world']\"\n--gin.export_lib.ExportableModule.use_batch_function=False\n--gin.export_lib.ExportableModule.use_gpu=False\n--gin.export_lib.ExportableModule.jit_compile=False\n--gin.ACTIVATION_DTYPE=\\\"${ACTIVATION_DTYPE}\\\"\n--gin.network.T5Config.dtype=\\\"${ACTIVATION_DTYPE}\\\"\n--gin.utils.RestoreCheckpointConfig.dtype=\\\"${ACTIVATION_DTYPE}\\\"\n--gin.DROPOUT_RATE=0.0\n)\n\n(python3 ${T5X_DIR}\u002Ft5x\u002Fexport.py \"${ARGS[@]}\")\n```\n\nFor detailed arguments definition refer to [export.gin]\n(t5x\u002Fconfigs\u002Fruns\u002Fexport.gin).\n\nYou can run XL and smaller models on NVIDIA A100 40GB, and XXL models on\nNVIDIA A100 80GB.\n\n## Custom components\n\n[The translation example](#example-english-to-german-translation) uses the\nencoder-decoder model that T5X provides as well as the dataset from the T5\nlibrary. This section shows how you can use your own dataset and a model and\npass via Gin.\n\n### Example: custom dataset in a user directory\n\nFor this example, we have the following directory structure with\n`${HOME}\u002Fdir1\u002Fuser_dir` representing a user directory with custom components.\n\n```\n${HOME}\n└── dir1\n    └── user_dir\n        ├── t5_1_1_base_de_en.gin\n        └── tasks.py\n```\n\nAs an example, let's define a new dataset. Here we use the same Translation\ndataset but we define the translation task in the opposite direction, i.e.,\nGerman to English intead of English to German. We define this task in `tasks.py`\n\n```py\n# ${HOME}\u002Fdir1\u002Fuser_dir\u002Ftasks.py\n\nimport functools\nimport seqio\nimport tensorflow_datasets as tfds\nfrom t5.evaluation import metrics\nfrom t5.data import preprocessors\n\nvocabulary = seqio.SentencePieceVocabulary(\n    'gs:\u002F\u002Ft5-data\u002Fvocabs\u002Fcc_all.32000\u002Fsentencepiece.model', extra_ids=100)\noutput_features = {\n    'inputs': seqio.Feature(vocabulary=vocabulary),\n    'targets': seqio.Feature(vocabulary=vocabulary)\n}\n\nseqio.TaskRegistry.add(\n    'wmt_t2t_de_en_v003',\n    source=seqio.TfdsDataSource(tfds_name='wmt_t2t_translate\u002Fde-en:1.0.0'),\n    preprocessors=[\n        functools.partial(\n            preprocessors.translate,\n            source_language='de', target_language='en'),\n        seqio.preprocessors.tokenize,\n        seqio.CacheDatasetPlaceholder(),\n        seqio.preprocessors.append_eos_after_trim,\n    ],\n    metric_fns=[metrics.bleu],\n    output_features=output_features)\n```\n\nIn the Gin file, most of the settings are equivalent to those used in the\n[En->De example](#example-english-to-german-translation). So we include the Gin\nfile from that example. To use \"wmt_t2t_de_en_v003\" task we just defined, we\nneed to import the task module \"tasks.py\". Note that we use a relative path\ndefined with respect to the user directory. This will be specified as a\nflag.\n\n```py\n# ${HOME}\u002Fdir1\u002Fuser_dir\u002Ft5_1_1_base_de_en.gin\nfrom __gin__ import dynamic_registration\nimport tasks  # This imports the task defined in dir1\u002Fuser_dir\u002Ftasks.py.\n\ninclude \"t5x-tmp\u002Ft5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_from_scratch.gin\"\nMIXTURE_OR_TASK_NAME = \"wmt_t2t_de_en_v003\"\n```\n\nFinally, we launch training passing the user directory as a flag\n`gin_search_paths` such that the Gin file and python modules can be specified\nwith relative paths.\n\n```sh\nPROJECT_DIR=${HOME}\"\u002Fdir1\u002Fuser_dir\"\nT5X_DIR=\"...\"  # directory where the t5x is cloned.\nTFDS_DATA_DIR=\"...\"\nMODEL_DIR=\"...\"\nexport PYTHONPATH=${PROJECT_DIR}\n\npython3 ${T5X_DIR}\u002Ft5x\u002Ftrain.py \\\n  --gin_search_paths=${PROJECT_DIR} \\\n  --gin_file=\"t5_1_1_base_de_en.gin\" \\\n  --gin.MODEL_DIR=\\\"${MODEL_DIR}\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\n## Checkpoints\n\n### Native Checkpoints\n\nWe have released the checkpoints of many of the original T5 models and their\nvariants a native T5X format for maximal efficiency.\nSee the [complete list](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\u002Fblob\u002Fmain\u002Fdocs\u002Fmodels.md) including the\nmatching Gin configuration files.\n\nThese are converted from the public [Mesh TensorFlow\ncheckpoints](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftext-to-text-transfer-transformer\u002Fblob\u002Fmain\u002Freleased_checkpoints.md#t511)\n.\n\n\n### Compatibility with the Mesh TensorFlow checkpoints\nThe Mesh TensorFlow checkpoints trained using the [T5 library][t5_github] can be\ndirectly loaded into T5X. For example, we can rerun the fine-tuning example\ninitializing from the MTF checkpoint by changing the `INIT_CHECKPOINT` Gin\nmacro.\n\n```sh\n# Model dir to save logs, ckpts, etc. in \"gs:\u002F\u002Fmodel_dir\" format.\nMODEL_DIR=\"...\"\n\n# Data dir to save the processed dataset in \"gs:\u002F\u002Fdata_dir\" format.\nTFDS_DATA_DIR=\"...\"\nT5X_DIR=\"...\"  # directory where the T5X repo is cloned.\n\npython3 ${T5X_DIR}\u002Ft5x\u002Ftrain.py \\\n  --gin_file=\"t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt19_ende_train.gin\" \\\n  --gin.MODEL_DIR=\\\"${MODEL_DIR}\\\" \\\n  --gin.MIXTURE_OR_TASK_NAME=\\\"wmt_t2t_ende_v003\\\" \\\n  --gin.INIT_CHECKPOINT=\\\"gs:\u002F\u002Ft5-data\u002Fpretrained_models\u002Ft5.1.1.base\u002Fmodel.ckpt-1000000\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\nNote that restoring directly from the Mesh TensorFlow checkpoints can be\ninefficient if heavy model parallelism is used for large models. This is\nbecause each host loads the entire copy of the model first and then keep only\nthe relevant slices dictated by the model parallelism specification. If you have\nMesh TensorFlow checkpoints that you run often, we recommend converting the\ncheckpoints to T5X native format using the\n[convert_tf_checkpoint script](t5x\u002Fscripts\u002Fconvert_tf_checkpoint.py).\n\n\n## Citing T5X\nPlease use the following bibtex entry to cite T5X.\n\n```\n@article{roberts2022t5x,\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.17189},\n  author = {Roberts, Adam and Chung, Hyung Won and Levskaya, Anselm and Mishra, Gaurav and Bradbury, James and Andor, Daniel and Narang, Sharan and Lester, Brian and Gaffney, Colin and Mohiuddin, Afroz and Hawthorne, Curtis and Lewkowycz, Aitor and Salcianu, Alex and van Zee, Marc and Austin, Jacob and Goodman, Sebastian and Soares, Livio Baldini and Hu, Haitang and Tsvyashchenko, Sasha and Chowdhery, Aakanksha and Bastings, Jasmijn and Bulian, Jannis and Garcia, Xavier and Ni, Jianmo and Chen, Andrew and Kenealy, Kathleen and Clark, Jonathan H. and Lee, Stephan and Garrette, Dan and Lee-Thorp, James and Raffel, Colin and Shazeer, Noam and Ritter, Marvin and Bosma, Maarten and Passos, Alexandre and Maitin-Shepard, Jeremy and Fiedel, Noah and Omernick, Mark and Saeta, Brennan and Sepassi, Ryan and Spiridonov, Alexander and Newlan, Joshua and Gesmundo, Andrea},\n  title = {Scaling Up Models and Data with $\\texttt{t5x}$ and $\\texttt{seqio}$},\n  journal={arXiv preprint arXiv:2203.17189},\n  year = {2022},\n}\n```\n\n\n## Note\nThis is not an officially supported Google product\n\n[t5_paper]: https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683\n[t5_github]: https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftext-to-text-transfer-transformer\n[gin-primer]: docs\u002Fusage\u002Fgin.md\n","# T5X\n\n*请前往 [T5X ReadTheDocs 文档页面](https:\u002F\u002Ft5x.readthedocs.io\u002F)。*\n\nT5X 是一个模块化、可组合且便于研究的框架，用于在多种规模下高效、可配置、自助式的序列模型（从语言模型开始）训练、评估和推理。\n\n它本质上是基于 [Mesh TensorFlow](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Fmesh) 的 [T5 代码库](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftext-to-text-transfer-transformer) 在 [JAX](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fjax) 和 [Flax](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fflax) 上的新版改进实现。欲了解更多信息，请参阅 [T5X 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.17189)。\n\n以下是使用 Google Cloud 上的 TPU 进行模型训练的快速入门指南。更多教程和背景信息，请参阅[完整文档](docs\u002Findex.md)。\n\n## 快速入门（推荐）\n\nT5X 可以通过 [XManager](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fxmanager) 在 [Vertex AI](https:\u002F\u002Fcloud.google.com\u002Fvertex-ai) 上运行。Vertex AI 是一个用于训练的平台，它可以创建 TPU 实例并在这些 TPU 上运行代码。当作业结束时，Vertex AI 还会自动关闭 TPU。这比手动管理 GCE 虚拟机和 TPU 虚拟机实例要方便得多。\n\n1. 按照先决条件和说明安装 [XManager](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fxmanager)。\n\n2. 根据需要申请 TPU 配额。GCP 项目默认提供 8 个核心，这足以在单个 TPU 主机上运行一次训练实验。如果想要进行多主机训练或并行运行多个试验，则需要更多的配额。请导航至 [Quotas](https:\u002F\u002Fconsole.cloud.google.com\u002Fquotas)。\n\n您需要的配额如下：\n\n* 服务：`Vertex AI API`\n* 维度（位置）：`us-central1`\n* 如果您想运行单主机实验：\n    * `Custom model training TPU V2 cores per region`\n    * `Custom model training TPU V3 cores per region`\n* 如果您想运行多主机实验：\n    * `Custom model training TPU V2 pod cores per region`\n    * `Custom model training TPU V3 pod cores per region`\n\n提示：您无法使用多主机配额来运行单主机实验。（即，您不能使用 `TPU V2 pod` 来运行 `tpu_v2=8`）\n\n3. 启动位于 `t5x\u002Fscripts\u002Fxm_launch.py` 的 xmanager 脚本。\n\n作为示例，我们使用 WMT14 英德翻译任务，该任务在下面的示例部分中有更详细的介绍。\n\n```sh\nexport GOOGLE_CLOUD_BUCKET_NAME=...\nexport TFDS_DATA_DIR=gs:\u002F\u002F$GOOGLE_CLOUD_BUCKET_NAME\u002Ft5x\u002Fdata\nexport MODEL_DIR=gs:\u002F\u002F$GOOGLE_CLOUD_BUCKET_NAME\u002Ft5x\u002F$(date +%Y%m%d)\n\n# 在多主机实验中预先下载数据集。\ntfds build wmt_t2t_translate --data_dir=$TFDS_DATA_DIR\n\ngit clone https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\ncd .\u002Ft5x\u002F\n\npython3 .\u002Ft5x\u002Fscripts\u002Fxm_launch.py \\\n  --gin_file=t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_from_scratch.gin \\\n  --model_dir=$MODEL_DIR \\\n  --tfds_data_dir=$TFDS_DATA_DIR\n```\n\n请查看 `gs:\u002F\u002F$GOOGLE_CLOUD_BUCKET_NAME\u002Ft5x\u002F` 中的输出文件，这些文件可以被 TensorBoard 读取。\n\n## GPU 使用\n注意：NVIDIA 已发布此仓库的更新版本，支持 H100 FP8 并大幅提升了 GPU 性能。有关详细信息和使用说明，请访问 [NVIDIA Rosetta](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FJAX-Toolbox\u002Ftree\u002Fmain\u002Frosetta\u002Frosetta\u002Fprojects\u002Ft5x) 仓库。\n\nT5X 可以轻松地在单节点或使用 SLURM+pyxis 集群的多节点配置中运行于 GPU 上。更多说明请参阅 [t5x\u002Fcontrib\u002Fgpu](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\u002Fblob\u002Fmain\u002Ft5x\u002Fcontrib\u002Fgpu\u002FREADME.md)。`t5x\u002Fcontrib\u002Fgpu\u002Fscripts_gpu` 文件夹包含用于在 [The Pile](https:\u002F\u002Fpile.eleuther.ai\u002F) 上预训练 T5X，以及在 SQuAD 和 MNLI 上微调的示例脚本。这些脚本及相关的 `gin` 配置还包含了额外的 GPU 优化，以提高吞吐量。更多示例和说明可以在由 NVIDIA 维护的 [NVIDIA Rosetta](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FJAX-Toolbox\u002Ftree\u002Fmain\u002Frosetta\u002Frosetta\u002Fprojects\u002Ft5x) 仓库中找到，该仓库支持 H100 FP8 并显著提升了 GPU 性能。\n\n\n## 安装\n\n请注意，除非另有说明，本文档中的所有命令都应在 TPU 虚拟机实例的命令行中执行。\n\n1. 按照\n    [说明](https:\u002F\u002Fcloud.google.com\u002Ftpu\u002Fdocs\u002Fjax-quickstart-tpu-vm#install_the_google_cloud_sdk)\n    设置 Google Cloud Platform (GCP) 帐户并启用 Cloud TPU API。\n\n    **注意:** T5X 也支持 GPU，如果您想使用 GPU 版本，请按照 [t5x\u002Fcontrib\u002Fgpu](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\u002Fblob\u002Fmain\u002Ft5x\u002Fcontrib\u002Fgpu\u002FREADME.md) 中的说明操作。\n\n2. 创建\n    [Cloud TPU 虚拟机实例](https:\u002F\u002Fcloud.google.com\u002Fblog\u002Fproducts\u002Fcompute\u002Fintroducing-cloud-tpu-vms)\n    按照\n    [此说明](https:\u002F\u002Fcloud.google.com\u002Ftpu\u002Fdocs\u002Fjax-quickstart-tpu-vm#create-vm)\n    进行操作。我们建议您先在单个 v3-8 TPU 上开发工作流（即，`--accelerator-type=v3-8`），待流程准备就绪后再扩展到 Pod 切片。在本 README 中，我们重点介绍如何使用单个 v3-8 TPU。有关 TPU 架构的更多信息，请参阅\n    [此处](https:\u002F\u002Fcloud.google.com\u002Ftpu\u002Fdocs\u002Fsystem-architecture-tpu-vm)。\n\n3. 使用 Cloud TPU 虚拟机时，您可以直接通过 SSH 登录到 TPU 虚拟机的宿主机。您可以在宿主机上安装软件包、运行代码等。一旦 TPU 实例创建完毕，即可通过以下命令登录：\n\n    ```sh\n    gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} --zone=${ZONE}\n    ```\n\n    其中 `TPU_NAME` 和 `ZONE` 是第 2 步中使用的名称和区域。\n\n4. 安装 T5X 和其依赖项。\n\n    ```sh\n    git clone --branch=main https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\n    cd t5x\n\n    python3 -m pip install -e '.[tpu]' -f \\\n      https:\u002F\u002Fstorage.googleapis.com\u002Fjax-releases\u002Flibtpu_releases.html\n\n    ```\n\n\n5. 创建 Google Cloud Storage (GCS) 存储桶，用于存储数据集和模型检查点。有关创建 GCS 存储桶的说明，请参阅这些\n    [instructions](https:\u002F\u002Fcloud.google.com\u002Fstorage\u002Fdocs\u002Fcreating-buckets)。\n\n6. （可选）如果您更喜欢 Jupyter\u002FColab 风格的环境，可以按照 [t5x\u002Fnotebooks](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\u002Fblob\u002Fmain\u002Ft5x\u002Fnotebooks\u002FREADME.md) 中的步骤设置自定义 Colab 运行环境。\n\n## 示例：英语到德语的翻译\n\n作为一个示例，我们使用 WMT14 英-德翻译数据集。原始数据集可以在 TensorFlow Datasets 中找到，其名称为\n[\"wmt_t2t_translate\"](https:\u002F\u002Fwww.tensorflow.org\u002Fdatasets\u002Fcatalog\u002Fwmt_t2t_translate)。\n\nT5 将诸如以下的翻译任务\n\n```py\n{'en': 'That is good.', 'de': 'Das ist gut.'}\n```\n\n转换为“文本到文本”的形式：\n\n```py\n{'inputs': 'translate English to German: That is good.', 'targets': 'Das ist gut.'}\n```\n\n这种表述方式使得多种不同的语言任务能够以统一的方式表达，并且单个编码器-解码器架构就可以在无需任何特定于任务的参数的情况下处理这些任务。更多细节请参阅 [T5 论文（Raffel 等人，2019）][t5_paper]。\n\n为了构建可扩展的数据管道和评估框架，我们使用了 [`SeqIO`](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fseqio)，它是从 [T5 库][t5_github] 中独立出来的。一个 `seqio.Task` 将原始数据集、词汇表、预处理（如分词）以及评估指标（如 BLEU [aclanthology.org\u002FP02-1040.pdf]）封装在一起，并提供一个 [`tf.data`](https:\u002F\u002Fwww.tensorflow.org\u002Fguide\u002Fdata) 实例。\n\n[T5 库][t5_github] 提供了许多在 [T5 论文][t5_paper] 中使用过的 `seqio.Task`。在这个示例中，我们使用 [wmt_t2t_ende_v003](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftext-to-text-transfer-transformer\u002Fblob\u002Fd81c0bab2a41b4d5dfbe4971de32f7d67df65f31\u002Ft5\u002Fdata\u002Ftasks.py#L212)。\n\n在开始训练或微调之前，您需要先下载 [\"wmt_t2t_translate\"] 数据集 (https:\u002F\u002Fwww.tensorflow.org\u002Fdatasets\u002Fcatalog\u002Fwmt_t2t_translate)。\n\n```sh\n# 用于保存处理后数据集的数据目录，格式为 \"gs:\u002F\u002Fdata_dir\"。\nTFDS_DATA_DIR=\"...\"\n\n# 确保数据集包是最新的。\npython3 -m pip install --upgrade tfds-nightly\n\n# 预先下载数据集。\ntfds build wmt_t2t_translate ${TFDS_DATA_DIR}\n```\n\n### 训练\n\n要运行训练任务，我们使用 `t5x\u002Ftrain.py` 脚本。\n\n```sh\n# 用于保存日志、检查点等的模型目录，格式为 \"gs:\u002F\u002Fmodel_dir\"。\nMODEL_DIR=\"...\"\nT5X_DIR=\"...\"  # T5X 仓库被克隆的目录。\nTFDS_DATA_DIR=\"...\"\n\npython3 ${T5X_DIR}\u002Ft5x\u002Ftrain.py \\\n  --gin_file=\"t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_from_scratch.gin\" \\\n  --gin.MODEL_DIR=\\\"${MODEL_DIR}\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\n本次训练的配置定义在 Gin 文件 [base_wmt_from_scratch.gin](t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_from_scratch.gin) 中。[Gin-config](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fgin-config) 是一个基于依赖注入来管理配置的库。Gin 的诸多优势之一是允许用户向 T5X 库传递自定义组件（例如自定义模型），而无需修改核心库。关于如何实现这一点，请参阅“自定义组件”部分。\n\n尽管核心库与 Gin 无关，但它是我们提供的示例的核心。因此，我们在 T5X 的上下文中提供了一个简短的 [Gin 入门介绍][gin-primer]。所有配置都写入 `MODEL_DIR` 中的一个文件 `config.gin`，这使得调试和实验复现变得更加容易。\n\n除了 `config.json` 外，`model-info.txt` 文件还总结了模型参数（形状、轴名称、分区信息）以及优化器状态。\n\n\n\n#### TensorBoard\n\n要在 [TensorBoard](https:\u002F\u002Fwww.tensorflow.org\u002Ftensorboard) 中监控训练过程，由于身份验证问题，最好在您自己的机器上启动 TensorBoard，而不是在 TPU 虚拟机中启动。因此，在您通过 SSH 登录到 TPU 虚拟机的命令行中，启动 TensorBoard 并将 `logdir` 指向 `MODEL_DIR`。\n\n```sh\n# 注意：请在您的机器上运行，而不是在 TPU 虚拟机上！\nMODEL_DIR=\"...\"  # 从 TPU 虚拟机复制过来。\ntensorboard --logdir=${MODEL_DIR}\n```\n\n或者，您也可以在 Colab 中启动 TensorBoard。在 Colab 单元格中运行\n\n```python\nfrom google.colab import auth\nauth.authenticate_user()\n```\n\n以授权 Colab 访问 GCS 存储桶，然后启动 TensorBoard。\n\n```python\n%load_ext tensorboard\nmodel_dir = \"...\"  # 从 TPU 虚拟机复制过来。\n%tensorboard --logdir=model_dir\n```\n\n\n### 微调\n\n我们可以利用自监督预训练的优势，从我们预训练好的模型之一进行初始化。这里我们使用 T5.1.1 Base 检查点。\n\n```sh\n# 用于保存日志、检查点等的模型目录，格式为 \"gs:\u002F\u002Fmodel_dir\"。\nMODEL_DIR=\"...\"\n\n# 用于保存处理后数据集的数据目录，格式为 \"gs:\u002F\u002Fdata_dir\"。\nTFDS_DATA_DIR=\"...\"\nT5X_DIR=\"...\"  # T5X 仓库被克隆的目录。\n\npython3 ${T5X_DIR}\u002Ft5x\u002Ftrain.py \\\n  --gin_file=\"t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_finetune.gin\" \\\n  --gin.MODEL_DIR=\\\"${MODEL_DIR}\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\n**注意：** 当通过标志传递字符串、字典、列表、元组值或 Bash 变量时，必须将其用引号括起来。对于字符串，需要使用转义引号 (`\\\"\u003Cstring>\\\"`)。例如：\n`--gin.utils.DatasetConfig.split=\\\"validation\\\"` 或\n`--gin.MODEL_DIR=\\\"${MODEL_DIR}\\\"`。\n\nGin 使更改许多配置变得非常容易。例如，您可以更改 `partitioning.PjitPartitioner.num_partitions`（覆盖 [base_wmt_from_scratch.gin](t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_from_scratch.gin) 中的值），以调整并行策略，并将其作为命令行参数传递。\n\n```sh\n--gin.partitioning.PjitPartitioner.num_partitions=8\n```\n\n### 评估\n\n要进行离线（即不涉及训练的）评估，可以使用 `t5x\u002Feval.py` 脚本。\n\n```sh\nEVAL_OUTPUT_DIR=\"...\"  # 用于写入评估结果的目录\nT5X_DIR=\"...\"  # t5x 被克隆的目录，例如 `${HOME}\"\u002Ft5x\"`.\nTFDS_DATA_DIR=\"...\"\nCHECKPOINT_PATH=\"...\"\n\npython3 ${T5X_DIR}\u002Ft5x\u002Feval.py \\\n  --gin_file=\"t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_eval.gin\" \\\n  --gin.CHECKPOINT_PATH=\\\"${CHECKPOINT_PATH}\\\" \\\n  --gin.EVAL_OUTPUT_DIR=\\\"${EVAL_OUTPUT_DIR}\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\n\n### 推理\n\n要进行推理，可以使用 `t5x\u002Finfer.py` 脚本。在这里，我们使用相同的 `seqio.Task`，但在推理过程中，我们不会使用目标特征，而是将它们与预测结果一起记录在一个 JSON 文件中。\n\n```sh\nINFER_OUTPUT_DIR=\"...\"  # 用于写入推理结果的目录\nT5X_DIR=\"...\"  # t5x 被克隆的目录，例如 `${HOME}\"\u002Ft5x\"`.\nTFDS_DATA_DIR=\"...\"\nCHECKPOINT_PATH=\"...\"\n\npython3 ${T5X_DIR}\u002Ft5x\u002Finfer.py \\\n  --gin_file=\"t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_infer.gin\" \\\n  --gin.CHECKPOINT_PATH=\\\"${CHECKPOINT_PATH}\\\" \\\n  --gin.INFER_OUTPUT_DIR=\\\"${INFER_OUTPUT_DIR}\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\n### 导出为 TensorFlow Saved Model\n\n预训练模型可以导出为 TensorFlow Saved Model，并使用 [优化的 TensorFlow 运行时]\n(https:\u002F\u002Fcloud.google.com\u002Fvertex-ai\u002Fdocs\u002Fpredictions\u002Foptimized-tensorflow-runtime) 部署到 Vertex AI Prediction 服务。请注意，导出的模型无法在基于开源的\n[TensorFlow Model Server](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Fserving) 上运行。\n\n```sh\nT5X_DIR=\"...\"  # t5x 被克隆的目录，例如 ${HOME}\"\u002Ft5x\"。\nCHECKPOINT_PATH=\"...\"\n\nBATCH_SIZE=None\nBEAM_SIZE=1\n\n# 如果计划在 NVIDIA A100 或更新的 GPU 上运行导出的模型，请使用 'bfloat16'；\n# 对于其他 GPU，则使用 'float32'。\nACTIVATION_DTYPE=bfloat16\n\n# 版本号必须是数字。我们根据当前日期和时间生成一个版本号。\nVERSION=$(date +%Y%m%d%H%M%S)\n\nNAME=t5x_base_${ACTIVATION_DTYPE}  # 模型名称。\n\n# 模型导出路径。请注意，导出脚本会在模型名称后添加 _cpu 后缀。\nOUTPUT=${CHECKPOINT_PATH}\u002Fsaved_model.${NAME}\u002F${VERSION}\n\ndeclare -a ARGS=(\n--gin_file=t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fbase.gin\n--gin_file=t5x\u002Ft5x\u002Fconfigs\u002Fruns\u002Fexport.gin\n--gin.TASK_FEATURE_LENGTHS=\"{'inputs': 256, 'targets': 256}\"\n--gin.CHECKPOINT_PATH=\\\"${CHECKPOINT_PATH}\\\"\n--gin.MODEL_NAME=\\\"\u002Fml\u002F${USER}\u002Ft5x_base\\\"\n--gin.MODEL_OUTPUT_DIR=\\\"${OUTPUT}\\\"\n--gin.BEAM_SIZE=${BEAM_SIZE}\n--gin.BATCH_SIZE=${BATCH_SIZE}\n--gin.export_lib.save.partitioner=None\n--gin.export_lib.save.warmup_examples=\"['hello world']\"\n--gin.export_lib.ExportableModule.use_batch_function=False\n--gin.export_lib.ExportableModule.use_gpu=False\n--gin.export_lib.ExportableModule.jit_compile=False\n--gin.ACTIVATION_DTYPE=\\\"${ACTIVATION_DTYPE}\\\"\n--gin.network.T5Config.dtype=\\\"${ACTIVATION_DTYPE}\\\"\n--gin.utils.RestoreCheckpointConfig.dtype=\\\"${ACTIVATION_DTYPE}\\\"\n--gin.DROPOUT_RATE=0.0\n)\n\n(python3 ${T5X_DIR}\u002Ft5x\u002Fexport.py \"${ARGS[@]}\")\n```\n\n有关详细参数定义，请参阅 [export.gin] (t5x\u002Fconfigs\u002Fruns\u002Fexport.gin)。\n\n您可以在 NVIDIA A100 40GB 上运行 XL 及更小的模型，在 NVIDIA A100 80GB 上运行 XXL 模型。\n\n## 自定义组件\n\n[翻译示例] (#example-english-to-german-translation) 使用了 T5X 提供的编码器-解码器模型以及 T5 库中的数据集。本节将展示如何使用您自己的数据集和模型，并通过 Gin 配置文件进行传递。\n\n### 示例：用户目录中的自定义数据集\n\n对于此示例，我们具有以下目录结构，其中 `${HOME}\u002Fdir1\u002Fuser_dir` 表示包含自定义组件的用户目录。\n\n```\n${HOME}\n└── dir1\n    └── user_dir\n        ├── t5_1_1_base_de_en.gin\n        └── tasks.py\n```\n\n作为示例，让我们定义一个新的数据集。这里我们使用相同的 Translation 数据集，但我们将翻译任务的方向设置为相反方向，即从德语到英语，而不是从英语到德语。我们在 `tasks.py` 中定义此任务：\n\n```py\n# ${HOME}\u002Fdir1\u002Fuser_dir\u002Ftasks.py\n\nimport functools\nimport seqio\nimport tensorflow_datasets as tfds\nfrom t5.evaluation import metrics\nfrom t5.data import preprocessors\n\nvocabulary = seqio.SentencePieceVocabulary(\n    'gs:\u002F\u002Ft5-data\u002Fvocabs\u002Fcc_all.32000\u002Fsentencepiece.model', extra_ids=100)\noutput_features = {\n    'inputs': seqio.Feature(vocabulary=vocabulary),\n    'targets': seqio.Feature(vocabulary=vocabulary)\n}\n\nseqio.TaskRegistry.add(\n    'wmt_t2t_de_en_v003',\n    source=seqio.TfdsDataSource(tfds_name='wmt_t2t_translate\u002Fde-en:1.0.0'),\n    preprocessors=[\n        functools.partial(\n            preprocessors.translate,\n            source_language='de', target_language='en'),\n        seqio.preprocessors.tokenize,\n        seqio.CacheDatasetPlaceholder(),\n        seqio.preprocessors.append_eos_after_trim,\n    ],\n    metric_fns=[metrics.bleu],\n    output_features=output_features)\n```\n\n在 Gin 文件中，大多数设置与 [英德翻译示例] (#example-english-to-german-translation) 中使用的设置相同。因此，我们直接引用该示例中的 Gin 文件。为了使用我们刚刚定义的“wmt_t2t_de_en_v003”任务，我们需要导入任务模块“tasks.py”。请注意，我们使用相对于用户目录的相对路径。这将作为标志指定。\n\n```py\n# ${HOME}\u002Fdir1\u002Fuser_dir\u002Ft5_1_1_base_de_en.gin\nfrom __gin__ import dynamic_registration\nimport tasks  # 导入 dir1\u002Fuser_dir\u002Ftasks.py 中定义的任务。\n\ninclude \"t5x-tmp\u002Ft5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_from_scratch.gin\"\nMIXTURE_OR_TASK_NAME = \"wmt_t2t_de_en_v003\"\n```\n\n最后，我们通过将用户目录作为 `gin_search_paths` 标志来启动训练，以便 Gin 文件和 Python 模块可以使用相对路径进行指定。\n\n```sh\nPROJECT_DIR=${HOME}\"\u002Fdir1\u002Fuser_dir\"\nT5X_DIR=\"...\"  # t5x 被克隆的目录。\nTFDS_DATA_DIR=\"...\"\nMODEL_DIR=\"...\"\nexport PYTHONPATH=${PROJECT_DIR}\n\npython3 ${T5X_DIR}\u002Ft5x\u002Ftrain.py \\\n  --gin_search_paths=${PROJECT_DIR} \\\n  --gin_file=\"t5_1_1_base_de_en.gin\" \\\n  --gin.MODEL_DIR=\\\"${MODEL_DIR}\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\n## 检查点\n\n### 原生检查点\n\n我们已发布许多原始 T5 模型及其变体的原生 T5X 格式检查点，以实现最高效率。请参阅 [完整列表] (https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\u002Fblob\u002Fmain\u002Fdocs\u002Fmodels.md)，其中包括匹配的 Gin 配置文件。\n\n这些检查点是从公开的 [Mesh TensorFlow 检查点] (https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftext-to-text-transfer-transformer\u002Fblob\u002Fmain\u002Freleased_checkpoints.md#t511) 转换而来的。\n\n### 与 Mesh TensorFlow 检查点的兼容性\n\n使用 [T5 库][t5_github] 训练的 Mesh TensorFlow 检查点可以直接加载到 T5X 中。例如，我们可以通过更改 `INIT_CHECKPOINT` Gin 宏，从 MTF 检查点初始化并重新运行微调示例。\n\n```sh\n# 用于保存日志、检查点等的模型目录，格式为“gs:\u002F\u002Fmodel_dir”。\nMODEL_DIR=\"...\"\n\n# 用于保存处理后的数据集的目录，格式为“gs:\u002F\u002Fdata_dir”。\nTFDS_DATA_DIR=\"...\"\nT5X_DIR=\"...\"  # t5x 代码库被克隆的目录。\n\npython3 ${T5X_DIR}\u002Ft5x\u002Ftrain.py \\\n  --gin_file=\"t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt19_ende_train.gin\" \\\n  --gin.MODEL_DIR=\\\"${MODEL_DIR}\\\" \\\n  --gin.MIXTURE_OR_TASK_NAME=\\\"wmt_t2t_ende_v003\\\" \\\n  --gin.INIT_CHECKPOINT=\\\"gs:\u002F\u002Ft5-data\u002Fpretrained_models\u002Ft5.1.1.base\u002Fmodel.ckpt-1000000\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\n请注意，如果大型模型使用了高并行度，直接从 Mesh TensorFlow 检查点恢复可能会效率较低。这是因为每个主机首先会加载整个模型副本，然后仅保留由模型并行度规范决定的相关切片。如果您经常运行 Mesh TensorFlow 检查点，我们建议使用 [convert_tf_checkpoint 脚本] (t5x\u002Fscripts\u002Fconvert_tf_checkpoint.py) 将检查点转换为 T5X 原生格式。\n\n## 引用 T5X\n请使用以下 BibTeX 条目来引用 T5X。\n\n```\n@article{roberts2022t5x,\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.17189},\n  author = {Roberts, Adam 和 Chung, Hyung Won 和 Levskaya, Anselm 和 Mishra, Gaurav 和 Bradbury, James 和 Andor, Daniel 和 Narang, Sharan 和 Lester, Brian 和 Gaffney, Colin 和 Mohiuddin, Afroz 和 Hawthorne, Curtis 和 Lewkowycz, Aitor 和 Salcianu, Alex 和 van Zee, Marc 和 Austin, Jacob 和 Goodman, Sebastian 和 Soares, Livio Baldini 和 Hu, Haitang 和 Tsvyashchenko, Sasha 和 Chowdhery, Aakanksha 和 Bastings, Jasmijn 和 Bulian, Jannis 和 Garcia, Xavier 和 Ni, Jianmo 和 Chen, Andrew 和 Kenealy, Kathleen 和 Clark, Jonathan H. 和 Lee, Stephan 和 Garrette, Dan 和 Lee-Thorp, James 和 Raffel, Colin 和 Shazeer, Noam 和 Ritter, Marvin 和 Bosma, Maarten 和 Passos, Alexandre 和 Maitin-Shepard, Jeremy 和 Fiedel, Noah 和 Omernick, Mark 和 Saeta, Brennan 和 Sepassi, Ryan 和 Spiridonov, Alexander 和 Newlan, Joshua 和 Gesmundo, Andrea},\n  title = {使用 $\\texttt{t5x}$ 和 $\\texttt{seqio}$ 扩展模型和数据},\n  journal={arXiv 预印本 arXiv:2203.17189},\n  year = {2022},\n}\n```\n\n\n## 注意\n这并非 Google 官方支持的产品\n\n[t5_paper]: https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683\n[t5_github]: https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftext-to-text-transfer-transformer\n[gin-primer]: docs\u002Fusage\u002Fgin.md","# T5X 快速上手指南\n\nT5X 是一个基于 JAX 和 Flax 构建的模块化、高性能序列模型框架，适用于大规模语言模型的训练、评估和推理。它是原始 T5 代码库（基于 Mesh TensorFlow）的现代化重构版本。\n\n## 环境准备\n\n### 系统要求\n*   **硬件**: 推荐使用 Google Cloud TPU (v3-8 或更高) 或 NVIDIA GPU (需参考 NVIDIA Rosetta 仓库以获取 H100 FP8 支持)。\n*   **操作系统**: Linux (TPU VM 或 GPU 服务器)。\n*   **云平台**: 若使用 TPU，需拥有 Google Cloud Platform (GCP) 账号并启用 Cloud TPU API。\n*   **存储**: 需要 Google Cloud Storage (GCS) Bucket 用于存放数据集和模型检查点。\n\n### 前置依赖\n*   Python 3.8+\n*   Git\n*   GCP SDK (仅 TPU 用户需要)\n*   XManager (可选，推荐用于 Vertex AI 自动化部署)\n\n> **注意**: 国内开发者若无法直接访问 GCP 或 HuggingFace，请自行配置网络代理或使用相应的镜像加速服务。\n\n## 安装步骤\n\n以下命令需在 TPU VM 实例或配置好的 Linux 环境中执行。\n\n1.  **克隆仓库**\n    ```sh\n    git clone --branch=main https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\n    cd t5x\n    ```\n\n2.  **安装依赖包**\n    \n    *   **TPU 环境**:\n        ```sh\n        python3 -m pip install -e '.[tpu]' -f \\\n          https:\u002F\u002Fstorage.googleapis.com\u002Fjax-releases\u002Flibtpu_releases.html\n        ```\n    \n    *   **GPU 环境**:\n        请参考 `t5x\u002Fcontrib\u002Fgpu\u002FREADME.md` 或 NVIDIA Rosetta 仓库进行特定安装。\n\n3.  **准备数据存储**\n    创建一个 GCS Bucket 用于存储数据和模型输出：\n    ```sh\n    # 替换为您的 bucket 名称\n    export GOOGLE_CLOUD_BUCKET_NAME=\"your-bucket-name\"\n    export TFDS_DATA_DIR=gs:\u002F\u002F$GOOGLE_CLOUD_BUCKET_NAME\u002Ft5x\u002Fdata\n    export MODEL_DIR=gs:\u002F\u002F$GOOGLE_CLOUD_BUCKET_NAME\u002Ft5x\u002Fmodels\u002F$(date +%Y%m%d)\n    ```\n\n4.  **预处理数据集 (示例：WMT14 英德翻译)**\n    ```sh\n    # 确保 tfds 包为最新版本\n    python3 -m pip install --upgrade tfds-nightly\n\n    # 下载并构建数据集\n    tfds build wmt_t2t_translate --data_dir=$TFDS_DATA_DIR\n    ```\n\n## 基本使用\n\nT5X 使用 Gin 配置文件来管理实验参数。以下是最简单的从头开始训练示例。\n\n### 1. 启动训练任务\n\n使用 `t5x\u002Ftrain.py` 脚本启动训练。以下示例使用 WMT14 数据集从头训练一个基础模型：\n\n```sh\npython3 .\u002Ft5x\u002Ftrain.py \\\n  --gin_file=t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_from_scratch.gin \\\n  --gin.MODEL_DIR=\\\"${MODEL_DIR}\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\n*   `--gin_file`: 指定配置文件路径，定义了模型架构、超参数等。\n*   `--gin.MODEL_DIR`: 指定模型日志和检查点的保存路径（注意字符串需转义引号）。\n*   `--tfds_data_dir`: 指定预处理后的数据路径。\n\n### 2. 监控训练进度 (TensorBoard)\n\n建议在**本地机器**（而非 TPU VM）上运行 TensorBoard 以避免认证问题，前提是本地已配置好 GCP 凭证。\n\n```sh\n# 在本地终端执行，将 MODEL_DIR 替换为实际的 GCS 路径\ntensorboard --logdir=${MODEL_DIR}\n```\n\n或者在 Colab 中运行：\n```python\nfrom google.colab import auth\nauth.authenticate_user()\n\n%load_ext tensorboard\nmodel_dir = \"gs:\u002F\u002Fyour-bucket-name\u002Ft5x\u002Fmodels\u002F...\" # 替换为实际路径\n%tensorboard --logdir=model_dir\n```\n\n### 3. 微调预训练模型 (可选)\n\n利用自监督预训练优势，加载现有检查点进行微调：\n\n```sh\npython3 .\u002Ft5x\u002Ftrain.py \\\n  --gin_file=t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_finetune.gin \\\n  --gin.MODEL_DIR=\\\"${MODEL_DIR}\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\n### 4. 离线评估\n\n训练完成后，使用 `eval.py` 对模型进行评估：\n\n```sh\nCHECKPOINT_PATH=\"gs:\u002F\u002Fpath\u002Fto\u002Fcheckpoint\"\nEVAL_OUTPUT_DIR=\"gs:\u002F\u002Fpath\u002Fto\u002Foutput\"\n\npython3 .\u002Ft5x\u002Feval.py \\\n  --gin_file=t5x\u002Fexamples\u002Ft5\u002Ft5_1_1\u002Fexamples\u002Fbase_wmt_eval.gin \\\n  --gin.CHECKPOINT_PATH=\\\"${CHECKPOINT_PATH}\\\" \\\n  --gin.EVAL_OUTPUT_DIR=\\\"${EVAL_OUTPUT_DIR}\\\" \\\n  --tfds_data_dir=${TFDS_DATA_DIR}\n```\n\n> **提示**: 若需修改并行策略或批次大小等参数，可通过命令行直接覆盖 Gin 配置，例如：\n> `--gin.partitioning.PjitPartitioner.num_partitions=8`","某跨国电商团队急需构建一个支持英、德、法三语的高精度商品评论翻译系统，以应对黑五期间的全球流量高峰。\n\n### 没有 t5x 时\n- **训练效率低下**：基于旧版 TensorFlow 的代码难以充分利用云端的 TPU 集群资源，大规模模型训练耗时数周且经常因显存溢出中断。\n- **实验迭代缓慢**：修改模型架构或调整超参数需要重写大量底层代码，研究人员无法快速验证新的算法想法。\n- **部署维护困难**：训练环境与推理环境不一致，导致模型上线时出现兼容性报错，运维团队需花费大量时间手动对齐配置。\n- **扩展性受限**：想要从单语言模型扩展到多语言联合训练时，原有框架缺乏模块化设计，代码耦合度极高，几乎需要重构。\n\n### 使用 t5x 后\n- **性能显著提升**：t5x 基于 JAX 和 Flax 重构，天然支持高性能 TPU 并行计算，将原本数周的训练周期缩短至几天，且运行稳定。\n- **研发敏捷高效**：借助其模块化与可组合特性，团队仅需修改少量 Gin 配置文件即可切换模型结构或数据集，极大加速了实验迭代。\n- **全链路一致性**：t5x 提供统一的训练、评估和推理接口，消除了环境差异，实现了从实验到生产环境的无缝平滑迁移。\n- **灵活弹性扩展**：利用其自服务架构，团队轻松实现了多语言混合训练任务，并能根据业务需求在单节点 GPU 或多主机 TPU 集群间自由切换。\n\nt5x 通过高性能的 JAX 实现与模块化设计，让大规模序列模型的研发布局从“艰难运维”转变为“专注创新”。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_t5x_8b6b5ccd.png","google-research","Google Research","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fgoogle-research_c23b2adf.png","","https:\u002F\u002Fresearch.google","https:\u002F\u002Fgithub.com\u002Fgoogle-research",[82,86,90,94],{"name":83,"color":84,"percentage":85},"Python","#3572A5",89.8,{"name":87,"color":88,"percentage":89},"Jupyter Notebook","#DA5B0B",9,{"name":91,"color":92,"percentage":93},"Shell","#89e051",1.1,{"name":95,"color":96,"percentage":97},"Dockerfile","#384d54",0,2961,341,"2026-04-04T18:43:36","Apache-2.0",5,"Linux","非必需（主要推荐 TPU）。若使用 GPU，需 NVIDIA 显卡（文档提及 H100 支持 FP8），需配合 SLURM+pyxis 集群或多节点配置。具体显存和 CUDA 版本未在本文档明确说明，建议参考 NVIDIA Rosetta 仓库。","未说明",{"notes":107,"python":108,"dependencies":109},"该工具主要基于 Google Cloud TPU VM 环境设计（推荐 v3-8 或更高），核心依赖 JAX 和 Flax 而非 PyTorch。若需在 GPU 上运行以获得最佳性能（特别是 H100），请参考 NVIDIA 维护的 Rosetta 仓库。数据存储和检查点保存需要配置 Google Cloud Storage (GCS) 桶。","3.8+",[110,111,112,113,114,115,116],"jax","flax","tensorstore","seqio","tensorflow-datasets (tfds)","gin-config","xmanager (可选，用于 Vertex AI)",[26,13],"2026-03-27T02:49:30.150509","2026-04-06T06:53:13.469256",[121,126,131,136,140,144],{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},17269,"保存检查点（checkpoint）后出现段错误（Segmentation Fault）怎么办？","这是一个已知问题，通常由 tensorstore 库的特定版本引起。解决方案是将 tensorstore 升级至 0.1.20 或更高版本。如果无法升级，有用户反馈降级到 0.1.14 版本也能解决该问题（特别是在 TPU v3\u002Fv4 环境下）。\n相关命令：\npip install tensorstore==0.1.20\n或者\npip install tensorstore==0.1.14","https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\u002Fissues\u002F366",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},17270,"在 Apple M1 芯片上加载模型检查点时遇到 'This event loop is already running' 错误如何解决？","该错误通常与文件路径格式有关。尝试将配置文件中的相对路径更改为绝对路径（full path）即可解决此问题。确保在加载检查点时提供完整的文件系统路径，而不是相对于当前目录的路径。","https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\u002Fissues\u002F446",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},17271,"在 TPU Pod 上训练 t5x-XXL 模型时出现 'Fatal Python error: Segmentation fault' 是什么原因？","这与 tensorstore 库的一个已知 Bug 有关（参考 google\u002Ftensorstore#30）。维护者确认该问题已在 tensorstore 0.1.20 版本中修复。请检查您的环境并升级 tensorstore：\npip install --upgrade tensorstore","https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\u002Fissues\u002F340",{"id":137,"question_zh":138,"answer_zh":139,"source_url":130},17272,"使用相对路径保存或加载检查点时报错 'Invalid key' 或 'Error reading local file' 怎么处理？","这是由相对路径解析问题引起的。解决方法是将所有检查点路径配置修改为绝对路径。例如，将 '.\u002Fpretrain_model\u002Fcheckpoint_...' 改为 '\u002Ffull\u002Fpath\u002Fto\u002Fpretrain_model\u002Fcheckpoint_...'。多位用户确认改用绝对路径后问题消失。",{"id":141,"question_zh":142,"answer_zh":143,"source_url":125},17273,"如何确认当前的段错误是否由 tensorstore 版本引起？","可以通过检查当前安装的 tensorstore 版本来确认。如果版本低于 0.1.20（如 0.1.17, 0.1.18, 0.1.19），则极有可能是该库导致的段错误。特别是在保存检查点时随机崩溃的情况。建议使用 'pip show tensorstore' 查看版本，并升级到 0.1.20 进行测试。",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},17274,"在单节点多 GPU（如 8xA100）环境下运行预训练时遇到分区错误怎么办？","虽然具体报错信息在提供的数据中被截断，但此类问题通常与 `num_partitions` 配置或 `logical_axis_rules` 设置不当有关。在 GPU 环境下，需确保 `partitioning.PjitPartitioner` 的配置与硬件拓扑匹配。如果使用的是官方示例脚本，请检查 `activation_partitioning_dims` 和 `parameter_partitioning_dims` 的设置是否符合 GPU 显存限制。对于纯 GPU 环境，有时需要将 `num_partitions` 设置为 1 并调整逻辑轴规则以避免跨设备通信错误。","https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ft5x\u002Fissues\u002F410",[]]